← All posts Post 06

I gave Claude Code a voice — and why nothing else worked


claude-voice is open source: github.com/Null-Phnix/claude-voice

The half-finished loop

When Anthropic shipped voice mode for Claude Code in March 2026, I was genuinely excited. Hold spacebar, talk, Claude transcribes and runs. No more typing every command. But after using it for a few hours, I realized the loop was only half complete.

I could talk to Claude. Claude couldn't talk back. Every response was still silent text scrolling in my terminal. The voice was one-way. It felt like calling someone who can hear you perfectly but only responds by texting.

What I tried

ElevenLabs — dead on arrival

First instinct: premium cloud TTS. Signed up for ElevenLabs, created an API key scoped to Text to Speech, fired the first request. Got a 402 back. Free tier can't use library voices through the API — only through their web playground. You need a paid plan to call any voice endpoint programmatically.

If I'm building something I want other people to install for free, I can't require them to sign up for a paid TTS service. Dead end.

VoiceMode — the 100-file overkill

VoiceMode is the most popular voice solution for Claude Code right now at 893 GitHub stars. It does two-way voice. It also has: a DJ mode with music ducking, sound fonts for tool events, a team connect system with presence detection, a credential store, systemd service templates, a FastAPI Kokoro wrapper, and 100+ Python source files.

It's impressive engineering. But I didn't want a voice platform. I wanted Claude to read its responses out loud. That's one feature. I didn't need the other ninety-nine.

VoiceMode also runs as an MCP server — meaning Claude has to explicitly decide to use the voice tool for each response. I wanted it automatic. Every response, spoken, no extra steps.

OpenAI TTS — works but wrong

OpenAI's TTS API sounds great and is straightforward to call. But it costs money per character, and every response gets sent to OpenAI's servers. I run local infrastructure specifically to avoid cloud dependencies. Sending every Claude response to a third-party TTS API defeats the entire philosophy.

Piper TTS — fast but robotic

Piper was my first local attempt. Runs on CPU, generates audio fast, easy to pipe through. But the voice quality is noticeably robotic. Fine for accessibility, not great for something you want to listen to for hours while coding.

What actually worked: Kokoro

Kokoro is an 82M parameter TTS model that runs on CPU and sounds surprisingly natural. It was already half-installed on my system from a previous experiment. I fixed a missing dependency, ran a test sentence, and immediately heard the difference. It wasn't perfect, but it was good enough that I wanted to keep listening.

Kokoro ships with 50+ voices across American English, British English, Japanese, Chinese, and more. I auditioned the American female voices and picked af_heart — warm and expressive without being distracting. Twelve of the voices are good enough for daily use.

The feature nobody built: word highlighting

Here's what surprised me most during research. Every existing TTS tool for Claude Code — VoiceMode, claude-code-tts, Claude-to-Speech — they all just play audio. That's it. The text scrolls, the voice speaks, and you have to track where it is yourself.

I wanted karaoke-style highlighting. The current word lit up in the terminal, the words around it slightly brighter, everything behind dimmed. A progress bar showing how far through the response we are. A sliding window so long responses don't overflow.

This turned out to be harder than the TTS itself. Claude Code's terminal renderer (React-based ink) fights with ANSI cursor movement. My first attempt with a multi-line panel left ghost lines everywhere. My second attempt with box-drawing characters overwrote Claude's output. The third attempt — a single-line karaoke renderer that only uses \r and one \033[1A — finally worked cleanly.

The trick was writing directly to /dev/tty instead of stdout. Claude Code's Stop hook captures stdout, but /dev/tty bypasses it and writes straight to the terminal. That's how the highlighting appears below the chat bar without interfering with Claude's own rendering.

How it works

claude-voice is a single Python file that installs as a Claude Code Stop hook. After every response:

1. The hook receives the assistant message as JSON on stdin.

2. It strips markdown, code blocks, URLs, tables — anything that shouldn't be spoken. It skips responses that are mostly code. It runs pronunciation fixes so terms like CLI, API, JSON, nginx, and kubectl are spoken correctly.

3. It generates audio for all sentences with Kokoro, then concatenates them into one seamless buffer. No gaps between sentences.

4. It plays the audio while rendering word-by-word highlighting to /dev/tty. A background thread listens for keypresses — any key interrupts playback immediately.

5. When done, it clears the display and restores the terminal. No artifacts.

What I learned

The biggest lesson was that developer tools don't need to be platforms. VoiceMode is a platform — it handles voice, music, teams, services, credentials. claude-voice is a feature. One specific feature, done well, in one file.

The second lesson was that local TTS has gotten genuinely good. Kokoro at 82M parameters sounds better than cloud TTS services did two years ago. The model is 350MB. It runs on any CPU. There's no reason to send audio generation to the cloud anymore for this use case.

The third lesson: terminal UX is underrated. The karaoke highlighting is the feature that makes people watch the demo twice. It's not technically complex — it's a sliding window with three ANSI colors. But it transforms "audio plays in background" into something you can see and follow. The visual is the hook.

Try it

pip install kokoro sounddevice numpy
git clone https://github.com/Null-Phnix/claude-voice
cd claude-voice
python speak.py setup
python speak.py demo

Source: github.com/Null-Phnix/claude-voice. Project page: claude-voice.

← All posts ← Previous: JobHound