Integrations
Voice
Local STT, single-shot intents, no wake word, no cloud transcription.
pnpm voice opens a hold-to-talk overlay, transcribes locally, classifies the intent, and runs an action. No wake word. No continuous listening. No cloud STT.
How it works
[mic capture] ──▶ [local STT] ──▶ [intent classifier] ──▶ [action] ──▶ [say back]
sox whisper.cpp Haiku · Zod schema sqlite say
- Capture —
soxrecords mono 16kHz WAV while you hold the button (or the global shortcut). - Transcribe —
whisper-cpprunssmall.enagainst the WAV. Sub-1s on Apple Silicon. - Classify — A short Haiku call with a Zod schema that pins the output shape:
z.discriminatedUnion('kind', [ z.object({ kind: 'add_task', title: z.string(), dueAt: z.string().optional() }), z.object({ kind: 'dismiss_suggestion', id: z.string() }), z.object({ kind: 'recall', query: z.string() }), z.object({ kind: 'unknown' }), ]) - Act — Run the action against the local DB. Conflict detection, side effects, the works.
- Confirm —
sayreads back a one-sentence confirmation. “Added: deep work block, 4 to 4:30 pm tomorrow. Conflicts with finance review.”
Total round-trip from release-to-confirm: ~2 seconds on a clean cache.
Backends
pnpm voice checks for transcription backends in this order:
| Backend | Notes |
|---|---|
whisper-cpp | Fastest. The default. brew install whisper-cpp. |
openai-whisper | Python. pip install openai-whisper. Fallback if no whisper-cpp on PATH. |
mlx-whisper | Apple Silicon optimized. Fastest on M-series. pip install mlx-whisper. |
You can pin the backend with VOICE_BACKEND=mlx pnpm voice.
Why single-shot
Continuous listening is great for products that want to be ambient. Mayva isn’t. Two problems with always-on:
- Privacy gradient — once you’re listening continuously, every conversation that drifts past the mic gets evaluated by a model. We don’t want the agent to know about conversations you didn’t address to it.
- Cost gradient — every transcribed second is a tokenized call. Hold-to-talk caps cost at “what you intended to say.”
The trade-off: you have to push a button. We’re fine with that.
Customizing the intents
packages/desktop/src/voice/intents.ts is one file. The schema is at the top; the action map is at the bottom:
const ACTIONS: Record<Intent['kind'], (i: Intent) => Promise<string>> = {
add_task: async (i) => addTask(i).then(confirm),
dismiss_suggestion: async (i) => dismissSuggestion(i.id).then(() => 'dismissed'),
recall: async (i) => searchVector(i.query).then(toSentence),
unknown: async () => 'sorry, I didn\'t catch that',
};
Add a new intent: add a variant to the union, add a row to the action map, ship.
Limitations
- macOS only today. Linux works for everything but the audio capture script depends on
saywhich is Apple-only. PRs welcome. - English only with
whisper-cpp small.en. Switch to a multilingual model for other languages. - No interruption handling — if you start the recording while
sayis still speaking, it’ll get garbled. We may add a quick mute on the next iteration.
Cost
Per voice call:
- STT: 0 tokens (local).
- Intent: ~600 input + ~50 output Haiku tokens. ~$0.0002/call.
- Action: usually 0 tokens (direct DB write). Some actions (recall) embed locally; that’s also free.
Even at a hundred voice calls a day, you’re under $0.05.