Great AI Products Need More Than a Model

I built a local, keyboard-first dictation app for macOS called Cadence.

Wispr is an absolutely phenomenal product. But for my workflow, I couldn't justify paying for another monthly subscription, so I decided to make a local-only app instead.

At first, the project sounded straightforward: capture audio, run Whisper locally, and insert text wherever I was typing. Whisper existed. Local inference existed. Coding agents existed. The ingredients were all there.

What I learned is that those ingredients were enough to make a prototype, not a product.

That is the core claim: the model created the capability, but the product emerged from the system around it.

Whisper was necessary. It was not sufficient.

What made Cadence usable had much less to do with "pick the right model" than I expected. The hard part was specifying the behavior around the model:

whether silence gets trimmed before inference
whether the first syllable gets clipped without a preroll buffer
whether live preview and final transcription should share a path
whether greedy decoding feels better than beam search for short dictation
whether the HUD clearly distinguishes recording from transcribing
whether keyboard shortcuts feel like gestures instead of configuration
whether macOS permissions feel product-like or sketchy

Those are easy details to dismiss when you say "build a local Whisper app." They are also the difference between something that technically works and something I actually want to use.

One of the biggest surprises was that many of the choices I thought of as inference settings were really UX decisions in disguise.

Take greedy decoding versus beam search. In theory, that sounds like a model-quality question. In practice, for a dictation app, it is a latency question. The user does not experience "decoding quality" in the abstract. They experience the gap between releasing a shortcut and seeing text appear. That makes the setting part of the product, not just part of the model.

The same thing was true for silence trimming and preroll. If the app captures dead air, latency gets worse. If it starts listening too late, the first word gets clipped. Neither problem is solved by saying "use a bigger model." They are pipeline problems, which means they are product problems.

That was the pattern across the whole build. The coding agent could implement changes quickly once I knew what was wrong. It could refactor the HUD, change defaults, wire up transcript history, and patch hotkey behavior. What it could not do reliably was notice why the experience felt off. I still had to diagnose the failure mode, decide what should be true, and rewrite the spec.

That is why I think the most defensible lesson from this project is broader than dictation:

A model gives you capability. A product gives you reliable value.

The gap between those two things is where product work lives.

That is also why I think so many AI products break through later than the underlying model breakthrough suggests they should. The model makes a new behavior possible. The product only arrives once someone builds the workflow, defaults, feedback loops, state management, and trust scaffolding that let a normal person use that behavior without thinking about the machinery underneath.

That was true in my much smaller way with Cadence. Whisper made local dictation possible. Cadence only became usable once the orchestration around Whisper got good enough.

I think the same pattern explains a lot of AI product adoption more generally. Transformers existed before ChatGPT mattered to normal users. Codex existed before coding tools became daily workflows for a much wider set of builders. The breakthrough is not just raw model capability. It is the surrounding system that makes the capability legible, reliable, and worth integrating into real work.

That is not a claim that models do not matter. They obviously do. A bad model limits the ceiling of the product.

It is a narrower claim: once the model clears a capability threshold, the bottleneck often shifts to orchestration, interface, and judgment.

That is what Cadence taught me.

I did not build a useful dictation app by discovering a better model. I built a better product by learning that the model was only one component in a much larger system.

And as code generation gets cheaper, I think that distinction matters more, not less.

The scarce thing is no longer just implementation.

It is knowing what needs to be true for the experience to work.