Voice — Mayva docs

pnpm voice opens a hold-to-talk overlay, transcribes locally, classifies the intent, and runs an action. No wake word. No continuous listening. No cloud STT.

How it works

[mic capture]  ──▶  [local STT]  ──▶  [intent classifier]  ──▶  [action]  ──▶  [say back]
   sox                whisper.cpp        Haiku · Zod schema       sqlite       say

Capture — sox records mono 16kHz WAV while you hold the button (or the global shortcut).
Transcribe — whisper-cpp runs small.en against the WAV. Sub-1s on Apple Silicon.

Classify — A short Haiku call with a Zod schema that pins the output shape:

z.discriminatedUnion('kind', [
  z.object({ kind: 'add_task', title: z.string(), dueAt: z.string().optional() }),
  z.object({ kind: 'dismiss_suggestion', id: z.string() }),
  z.object({ kind: 'recall', query: z.string() }),
  z.object({ kind: 'unknown' }),
])

Act — Run the action against the local DB. Conflict detection, side effects, the works.
Confirm — say reads back a one-sentence confirmation. “Added: deep work block, 4 to 4:30 pm tomorrow. Conflicts with finance review.”

Total round-trip from release-to-confirm: ~2 seconds on a clean cache.

Backends

pnpm voice checks for transcription backends in this order:

Backend	Notes
`whisper-cpp`	Fastest. The default. `brew install whisper-cpp`.
`openai-whisper`	Python. `pip install openai-whisper`. Fallback if no `whisper-cpp` on PATH.
`mlx-whisper`	Apple Silicon optimized. Fastest on M-series. `pip install mlx-whisper`.

You can pin the backend with VOICE_BACKEND=mlx pnpm voice.

Why single-shot

Continuous listening is great for products that want to be ambient. Mayva isn’t. Two problems with always-on:

Privacy gradient — once you’re listening continuously, every conversation that drifts past the mic gets evaluated by a model. We don’t want the agent to know about conversations you didn’t address to it.
Cost gradient — every transcribed second is a tokenized call. Hold-to-talk caps cost at “what you intended to say.”

The trade-off: you have to push a button. We’re fine with that.

Customizing the intents

packages/desktop/src/voice/intents.ts is one file. The schema is at the top; the action map is at the bottom:

const ACTIONS: Record<Intent['kind'], (i: Intent) => Promise<string>> = {
  add_task: async (i) => addTask(i).then(confirm),
  dismiss_suggestion: async (i) => dismissSuggestion(i.id).then(() => 'dismissed'),
  recall: async (i) => searchVector(i.query).then(toSentence),
  unknown: async () => 'sorry, I didn\'t catch that',
};

Add a new intent: add a variant to the union, add a row to the action map, ship.

Limitations

macOS only today. Linux works for everything but the audio capture script depends on say which is Apple-only. PRs welcome.
English only with whisper-cpp small.en. Switch to a multilingual model for other languages.
No interruption handling — if you start the recording while say is still speaking, it’ll get garbled. We may add a quick mute on the next iteration.

Cost

Per voice call:

STT: 0 tokens (local).
Intent: ~600 input + ~50 output Haiku tokens. ~$0.0002/call.
Action: usually 0 tokens (direct DB write). Some actions (recall) embed locally; that’s also free.

Even at a hundred voice calls a day, you’re under $0.05.