Afterwords — Local Voice-Cloning TTS

Setup

Five minutes to your first voice

The setup script checks your hardware, installs dependencies, walks you through cloning a voice from YouTube, and starts the server.

git clone https://github.com/adrianwedd/afterwords.git
cd afterwords
bash setup.shclick to copy

Setup installs the afterwords command to your PATH and starts the server. If Claude Code is installed, it also wires a Stop hook so every response is spoken aloud. Without it, you get a standalone TTS API at localhost:7860.

Voice Gallery

20 demo clips from the 110+ voice gallery — see backend comparison below for flagship voices across all 6 models

Each says “You are absolutely right. Your Claude Code session could sound like me.” — generated locally on a 32 GB M1.

Attenborough

David Attenborough, BBC Earth

Audrey

Audrey Hepburn, 1961

Aurora

AURORA, Shower Thoughts

Avasarala

Shohreh Aghdashloo, The Expanse

Bardem

Javier Bardem, Vicky Cristina Barcelona

Claudia

Claudia Black, Dragon Age

Data

Brent Spiner, Star Trek TNG

Depp

Johnny Depp, interview

Eartha

Eartha Kitt, interview

Galadriel

Cate Blanchett, LOTR

Han Solo

Harrison Ford, Star Wars

Loki

Tom Hiddleston, Avengers

Marla

Helena Bonham Carter, Fight Club

Picard

Patrick Stewart, Star Trek

Ronan

Ronan Keating, interview

Samantha

Scarlett Johansson, Her

Snape

Alan Rickman, Harry Potter

Spock

Leonard Nimoy, Star Trek

Tilda

Tilda Swinton, interview

Vesper

Eva Green, Casino Royale

Add your own:

afterwords clone "https://youtube.com/watch?v=..." myvoice 30click to copy

Model Size Comparison

Same voice, two model sizes

Three flagship voices, each synthesized by both Qwen3-TTS sizes. Click a tab to hear the difference between the 0.6B (default, fastest) and 1.7B (higher quality, slower) variants on the same 15-second reference.

Picard

Patrick Stewart, Star Trek

Galadriel

Cate Blanchett, LOTR

Attenborough

David Attenborough, BBC Earth

Chatterbox and VoxCPM are also loaded as backends but their voice-cloning fidelity in this integration is under verification — samples are tracked in #14 and will be added once we’re confident they’re cloning the reference rather than producing default voices.

How It Works

Input meets output

Claude Code integration

You speak

→

/voice

→

Claude responds

→

Stop hook

→

TTS server

→

Speaker

Standalone API

Any client

→

GET /synthesize

→

TTS server

→

WAV audio

Programmatic cloning (--allow-clone)

Upload audio

→

POST /clone

→

Denoise + transcribe

→

Voice palette

→

POST /synthesize

/voice handles input. This project handles output. Together: voice conversations.

The server ships four MLX backends: Qwen3-TTS (0.6B and 1.7B, 8-bit), Chatterbox (fp16, multilingual), and VoxCPM 1.5 (44.1 kHz). Zero-shot voice cloning — no training. A 15-second reference + transcript = cloned voice on every backend.

Why local? Nothing leaves your machine. No API key, no rate limits, no bill. The voice is yours.

API Reference

A plain HTTP interface

The server runs on localhost:7860. No authentication. Use it from curl, scripts, other editors, web apps — anything that speaks HTTP. Endpoints marked --allow-clone require launching the server with that flag.

GET /health

Server status, loaded voices, and readiness.

curl localhost:7860/health | jq .click to copy

{
  "status": "ok",
  "model": "mlx-community/Qwen3-TTS-12Hz-0.6B-Base-8bit",
  "backend": "mlx",
  "model_loaded": true,
  "ready": true,
  "voices": ["attenborough", "attenborough-chatterbox", "attenborough-qwen3-17b", "attenborough-voxcpm-15", "audrey", "..."],
  "default_voice": "galadriel",
  "loaded_backends": {
    "qwen3-0.6b":  {"loaded": true, "voice_count": 45, "sample_rate": 24000, "display_name": "Qwen3-TTS 0.6B", "supported_langs": ["en","zh","ja","ko","es","fr","de","it","pt","ru"]},
    "qwen3-1.7b":  {"loaded": true, "voice_count": 61, "sample_rate": 24000, "display_name": "Qwen3-TTS 1.7B", "supported_langs": ["en","zh","ja","ko","es","fr","de","it","pt","ru"]},
    "chatterbox":  {"loaded": true, "voice_count": 3,  "sample_rate": 24000, "display_name": "Chatterbox (fp16, multilingual)", "supported_langs": ["en","es","fr","de","it","pt","zh","ja","ko"]},
    "voxcpm-1.5":  {"loaded": true, "voice_count": 3,  "sample_rate": 44100, "display_name": "VoxCPM 1.5", "supported_langs": ["en","zh"]},
    "voxtral":     {"loaded": true, "voice_count": 0,  "sample_rate": 24000, "display_name": "Voxtral 4B", "supported_langs": ["en","fr","de","es","it","pt","nl","ru","zh","ja","ko","ar","hi"]},
    "soprotts":    {"loaded": true, "voice_count": 0,  "sample_rate": 24000, "display_name": "SoproTTS", "supported_langs": ["en"]}
  }
}click to copy

200 OK

GET /synthesize

Generate speech from text. Returns 16-bit PCM WAV audio.

text required — string, max 5000 chars

voice optional — defaults to galadriel. Any name from /health

lang optional — BCP-47 language code, defaults to en. Must be in the voice's backend's supported_langs (see /health). If unsupported and the voice declares a family, the server auto-routes to a same-family voice on a backend that supports it.

# Synthesize and play
curl "localhost:7860/synthesize?text=Hello+world&voice=snape" -o out.wav
afplay out.wav

# Non-English (the voice's backend must support the lang, or its family must)
curl "localhost:7860/synthesize?text=Ni+hao&voice=galadriel&lang=zh" -o hi.wav

# Pipe directly to speaker (macOS)
curl -s "localhost:7860/synthesize?text=Testing" | afplay -click to copy

Response includes timing headers: X-Synthesis-Time, X-Duration, X-Sample-Rate, X-Backend (the actual backend that synthesized — may differ from the voice's pinned backend if family-routing kicked in).

200 audio/wav 400 unknown voice 400 lang not supported 400 text empty / too long 500 synthesis failed 503 warming up

CLI afterwords clone

Clone a new voice from a YouTube clip. Run afterwords reload to load it (no restart needed).

# Interactive (prompts for URL and name)
afterwords clone

# Non-interactive (URL, name, start-second)
afterwords clone "https://youtube.com/watch?v=..." mycustomvoice 30

# Fully automated (skip transcript confirmation)
afterwords clone "https://youtube.com/watch?v=..." mycustomvoice 30 --yesclick to copy

Each voice is a 700 KB WAV + JSON profile in voices/. Adding voices costs zero extra memory.

POST /synthesize

JSON body version of /synthesize. Supports emotion-based palette lookup for session-cloned voices. Requires --allow-clone.

text required — string, max 5000 chars

voice required — voice name or session ID

emotion optional — selects the matching palette entry (e.g. "cheerful", "serious")

curl -X POST localhost:7860/synthesize \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "voice": "my-session", "emotion": "cheerful"}' \
  -o out.wavclick to copy

200 audio/wav 400 unknown voice 404 --allow-clone not enabled 503 warming up

POST /clone

Create a voice profile from uploaded audio. Denoises, optionally transcribes via Whisper, and registers the voice for immediate use. Requires --allow-clone.

audio required — WAV file upload (multipart/form-data)

session_id required — groups palette entries (e.g. "my-voice")

emotion optional — tag for this entry, defaults to "neutral"

transcript optional — if omitted, auto-transcribed with Whisper

# Clone a single voice
curl -X POST localhost:7860/clone \
  -F "audio=@sample.wav" \
  -F "session_id=my-voice" \
  -F "emotion=neutral"

# Build a palette with multiple emotions
curl -X POST localhost:7860/clone \
  -F "audio=@cheerful.wav" -F "session_id=my-voice" -F "emotion=cheerful"
curl -X POST localhost:7860/clone \
  -F "audio=@serious.wav" -F "session_id=my-voice" -F "emotion=serious"click to copy

Returns voice name, quality rating (rough/developing/good based on duration), session ID, and sequence number. Voices are available immediately — no restart needed.

200 JSON 400 audio too short 404 --allow-clone not enabled 500 clone failed

POST /reload

Rescan voices/*.json and merge new or changed profiles into the live registry — no restart, no synthesis interruption. Add-only and atomic: if any profile fails to validate, no changes commit. Requires --allow-clone.

# After cloning a new voice or editing voices/*.json:
curl -X POST localhost:7860/reload | jq .

# Or via the CLI wrapper:
afterwords reloadclick to copy

Returns {"status":"ok","reloaded":[names...],"errors":[]} on success or {"status":"failed","errors":[{file, error}]} on atomic abort. Voices removed from disk are NOT dropped — use DELETE /session/{id} for that.

200 OK 404 --allow-clone not enabled 500 atomic abort (errors[] populated)

DELETE /session/{session_id}

Remove all voice palette entries and files for a session. Also cleans up any backend temp files (e.g. VoxCPM resampled refs). Requires --allow-clone.

curl -X DELETE localhost:7860/session/my-voiceclick to copy

200 OK 404 --allow-clone not enabled

Per-Project

Different voice, different repo

Drop a .afterwords file in any project root. The hook reads it before each synthesis — no server restart.

echo "galadriel" > ~/work/frontend/.afterwords
echo "snape"     > ~/work/backend/.afterwords
echo "loki"      > ~/fun/side-project/.afterwordsclick to copy

If your project uses multiple agents, map each one to its own voice:

# .afterwords — agent-to-voice mapping
default: data
clara-oswald: clara-oswald
donna-noble: donna-noble
k9: k9click to copy

When Claude Code spawns a subagent, the hook reads its agent_type and looks up the matching voice. Falls back to default: if no match.

Emotion Palettes

One voice, many moods

Clone the same speaker multiple times with different emotional deliveries. The server groups them into a palette by session ID.

# Clone three emotions from the same speaker
curl -X POST localhost:7860/clone \
  -F "audio=@neutral.wav" -F "session_id=narrator" -F "emotion=neutral"
curl -X POST localhost:7860/clone \
  -F "audio=@cheerful.wav" -F "session_id=narrator" -F "emotion=cheerful"
curl -X POST localhost:7860/clone \
  -F "audio=@serious.wav" -F "session_id=narrator" -F "emotion=serious"

# Synthesize with a specific emotion
curl -X POST localhost:7860/synthesize \
  -H "Content-Type: application/json" \
  -d '{"text": "Great news!", "voice": "narrator", "emotion": "cheerful"}' \
  -o out.wavclick to copy

Each palette entry gets a quality rating based on clip duration: rough (<5s), developing (5–15s), good (15s+). If no emotion match is found, the server falls back to the best-quality entry for that session. Clean up with DELETE /session/{id}.

Requires --allow-clone. Palette voices are ephemeral — stored in memory and on disk while the server runs. Use afterwords clone for permanent voices.

Managing the Server

One command for everything

The afterwords CLI is added to your PATH during setup. It wraps launchd, health checks, and voice management into a single tool.

afterwords start       # start the server (auto-starts on login)
afterwords stop        # stop the server
afterwords restart     # restart after adding voices
afterwords status      # show health, model, loaded voices
afterwords logs        # tail the server log
afterwords voices      # list available voices
afterwords reload      # pick up new voices without restart
afterwords clone       # clone a new voice from YouTube
afterwords uninstall   # remove service and optionally hooksclick to copy

The server runs on localhost:7860 and auto-starts on login via macOS launchd. If you prefer to run it manually:

source .venv/bin/activate
python server.py                  # read-only (GET endpoints only)
python server.py --allow-clone    # enables POST /clone, POST /synthesize, DELETE /sessionclick to copy

--allow-clone enables the clone and session endpoints and automatically binds to 127.0.0.1 (localhost only) for security.

Performance

On a 32 GB Apple Silicon Mac

Qwen3 0.6B

~20s / sentence

Qwen3 1.7B

~35s / sentence

Chatterbox

~25s / sentence

VoxCPM 1.5

~30s / sentence

Peak memory

~10 GB (all 4 loaded)

Adding a voice

0 extra RAM

Requirements

What you need

Hardware

Apple Silicon M1+

Memory

32 GB+ RAM

Python

3.11+

Disk

~2 GB

The setup script installs everything else. Claude Code is optional — use bash setup.sh --server-only for the API without hooks. Either way, you get the afterwords CLI for managing the server.

Credits

Qwen3-TTS (Alibaba) · Chatterbox (mlx-community) · VoxCPM (mlx-community) · mlx-audio · MLX (Apple) · Claude Code (Anthropic)

Originally built for SPARK, a robot with an inner life. Full tutorial: Voice Cloning with Qwen3-TTS.

Give your codea voice

Give your code
a voice