Integrations

Voice

Local STT, single-shot intents, no wake word, no cloud transcription.

Updated · 2026-05-28

pnpm voice opens a hold-to-talk overlay, transcribes locally, classifies the intent, and runs an action. No wake word. No continuous listening. No cloud STT.

How it works

[mic capture]  ──▶  [local STT]  ──▶  [intent classifier]  ──▶  [action]  ──▶  [say back]
   sox                whisper.cpp        Haiku · Zod schema       sqlite       say
  1. Capturesox records mono 16kHz WAV while you hold the button (or the global shortcut).
  2. Transcribewhisper-cpp runs small.en against the WAV. Sub-1s on Apple Silicon.
  3. Classify — A short Haiku call with a Zod schema that pins the output shape:
    z.discriminatedUnion('kind', [
      z.object({ kind: 'add_task', title: z.string(), dueAt: z.string().optional() }),
      z.object({ kind: 'dismiss_suggestion', id: z.string() }),
      z.object({ kind: 'recall', query: z.string() }),
      z.object({ kind: 'unknown' }),
    ])
    
  4. Act — Run the action against the local DB. Conflict detection, side effects, the works.
  5. Confirmsay reads back a one-sentence confirmation. “Added: deep work block, 4 to 4:30 pm tomorrow. Conflicts with finance review.”

Total round-trip from release-to-confirm: ~2 seconds on a clean cache.

Backends

pnpm voice checks for transcription backends in this order:

BackendNotes
whisper-cppFastest. The default. brew install whisper-cpp.
openai-whisperPython. pip install openai-whisper. Fallback if no whisper-cpp on PATH.
mlx-whisperApple Silicon optimized. Fastest on M-series. pip install mlx-whisper.

You can pin the backend with VOICE_BACKEND=mlx pnpm voice.

Why single-shot

Continuous listening is great for products that want to be ambient. Mayva isn’t. Two problems with always-on:

  • Privacy gradient — once you’re listening continuously, every conversation that drifts past the mic gets evaluated by a model. We don’t want the agent to know about conversations you didn’t address to it.
  • Cost gradient — every transcribed second is a tokenized call. Hold-to-talk caps cost at “what you intended to say.”

The trade-off: you have to push a button. We’re fine with that.

Customizing the intents

packages/desktop/src/voice/intents.ts is one file. The schema is at the top; the action map is at the bottom:

const ACTIONS: Record<Intent['kind'], (i: Intent) => Promise<string>> = {
  add_task: async (i) => addTask(i).then(confirm),
  dismiss_suggestion: async (i) => dismissSuggestion(i.id).then(() => 'dismissed'),
  recall: async (i) => searchVector(i.query).then(toSentence),
  unknown: async () => 'sorry, I didn\'t catch that',
};

Add a new intent: add a variant to the union, add a row to the action map, ship.

Limitations

  • macOS only today. Linux works for everything but the audio capture script depends on say which is Apple-only. PRs welcome.
  • English only with whisper-cpp small.en. Switch to a multilingual model for other languages.
  • No interruption handling — if you start the recording while say is still speaking, it’ll get garbled. We may add a quick mute on the next iteration.

Cost

Per voice call:

  • STT: 0 tokens (local).
  • Intent: ~600 input + ~50 output Haiku tokens. ~$0.0002/call.
  • Action: usually 0 tokens (direct DB write). Some actions (recall) embed locally; that’s also free.

Even at a hundred voice calls a day, you’re under $0.05.