Docs / CLI

Local mode — zero-cloud conversation

Run a fully on-device avatar brain — whisper.cpp + llama.cpp + Supertonic + Silero VAD — with one env-var flip. No API key, no outbound network, ~1.5 GB RAM.

The on-device brain

bithuman-cli[local] is an opt-in extra that swaps the cloud conversation brain (OpenAI Realtime) for an entirely in-process, on-device stack — whisper.cpp + llama.cpp + Supertonic + Silero VAD. No API key, no outbound network, no separate servers. Same bithuman run command, same browser URL, same avatar.

pip install 'bithuman-cli[local]'
export BITHUMAN_API_SECRET=your_api_secret
bithuman pull modern-court-jester
BITHUMAN_LOCAL=1 bithuman run ~/.cache/bithuman/showcase/modern-court-jester.imx
# → open the printed http://127.0.0.1:8088/<CODE> URL in a browser

The [local] extra ships on the CLI sibling wheel (bithuman-cli) for macOS arm64, Linux x86_64, and Linux aarch64; Python 3.10–3.14.

Note bithuman run still pings api.bithuman.ai for avatar credit accounting even in local mode — BITHUMAN_API_SECRET is required (or BITHUMAN_UNMETERED=1 for dev parity testing). Only the conversation brain goes offline.

When to reach for it

  • Privacy-bound deployments — kiosks, healthcare, finance, classroom. Audio never leaves the device.
  • Offline / air-gapped — conference demos on shaky WiFi, edge boxes in stores, field engineers.
  • Eliminate per-minute LLM/TTS spend — pay once for the wheel; conversations are free thereafter.
  • Latency floor below network RTT — when the speed of light to the nearest cloud region is itself the bottleneck.
  • Mobile portability ramp — the same C++ cores (whisper.cpp, llama.cpp, ONNX Runtime) have first-party iOS and Android builds, and run the exact same model files.

The stack

SlotBackend (mobile-portable C++ core)Default modelDiskRAM
STTpywhispercpp → whisper.cpptiny.en77 MB~150 MB
LLMllama-cpp-python → llama.cppQwen 2.5 0.5B-Instruct Q4_K_M400 MB~600 MB
TTSsupertonic → ONNX RuntimeSupertonic 3 (31 languages, voice M1)380 MB~600 MB
VADlivekit-plugins-sileroSilero5 MB~50 MB
Avatarlibessence.imxvaries~300 MB

Total: ~860 MB on disk for the brain plus your avatar file; ~1.5 GB RAM peak. The first run downloads the brain models from HuggingFace into ~/.cache/{huggingface,supertonic} — about 90 seconds, once. Subsequent runs warm-load in under a second.

How it fits together

The local-mode brain is the same livekit-agents worker as the cloud brain — only the stt / llm / tts plugins differ. The avatar transport (AvatarSession), the embedded livekit-server, the browser player, and the lifecycle are all unchanged.

bithuman run avatar.imx                  (Rust CLI, same as cloud)
├─ embedded livekit-server               (same)
├─ libessence avatar runtime             (same)
└─ bithuman.agent                        (Python livekit-agents worker)
   if BITHUMAN_LOCAL=1:                  (← the only branch)
     stt = WhisperSTT(...)
     llm = LlamaCppLLM(...)
     tts = SupertonicTTS(...)
     vad = silero.VAD.load()
   else:
     llm = openai.realtime.RealtimeModel(...)

The plugin classes live in the same namespace as the avatar plugin — livekit.plugins.bithuman.{WhisperSTT, LlamaCppLLM, SupertonicTTS} — so you can also use them in your own custom brain without going through bithuman run. See audio streaming and the Python SDK for the integration pattern.

Tuning

All optional. Defaults are tuned for the smallest models that produce fluent English on a laptop CPU.

Env varDefaultWhat
BITHUMAN_LOCALunset=1 flips the brain to the local stack.
BITHUMAN_LOCAL_WHISPERtiny.enwhisper.cpp model size — tiny.en / base.en (multilingual: tiny, base, small, medium, large-v3-turbo). Larger = more accurate, more RAM, slower.
BITHUMAN_LOCAL_LLMQwen/Qwen2.5-0.5B-Instruct-GGUFHuggingFace repo id of any GGUF chat LLM.
BITHUMAN_LOCAL_LLM_FILEqwen2.5-0.5b-instruct-q4_k_m.ggufGGUF file within the repo.
BITHUMAN_LOCAL_VOICEM1Supertonic preset (M1M5 male, F1F5 female).
BITHUMAN_LOCAL_LANGenSupertonic language. 31 supported: en, ko, ja, es, de, fr, it, pt, ru, zh, hi, ar, nl, pl, tr, vi, id, th, …
BITHUMAN_INSTRUCTIONSshort defaultOverride the system prompt.

Upgrading the LLM for quality

The 0.5B default keeps the entire stack under 1.5 GB RAM. For better reasoning, swap to a bigger GGUF — any HuggingFace *-GGUF repo works:

# 1.5B (~1 GB on disk, ~1.5 GB RAM):
export BITHUMAN_LOCAL_LLM="Qwen/Qwen2.5-1.5B-Instruct-GGUF"
export BITHUMAN_LOCAL_LLM_FILE="qwen2.5-1.5b-instruct-q4_k_m.gguf"

# Llama 3.2 3B (~2 GB on disk, ~3 GB RAM):
export BITHUMAN_LOCAL_LLM="bartowski/Llama-3.2-3B-Instruct-GGUF"
export BITHUMAN_LOCAL_LLM_FILE="Llama-3.2-3B-Instruct-Q4_K_M.gguf"

BITHUMAN_LOCAL=1 bithuman run avatar.imx

Multilingual setup

# Korean: bigger whisper + Korean Supertonic voice
export BITHUMAN_LOCAL_WHISPER=small        # multilingual
export BITHUMAN_LOCAL_LANG=ko
export BITHUMAN_LOCAL_VOICE=F1

BITHUMAN_LOCAL=1 bithuman run avatar.imx

The default LLM (Qwen 2.5) is multilingual; for production multilingual deployments, size it up too (1.5B+ recommended for non-English fluency).

Cloud vs local — side by side

Cloud (default)Local (BITHUMAN_LOCAL=1)
Setupexport OPENAI_API_KEY=...First run downloads ~860 MB once
Per-conversation costOpenAI Realtime pricing$0 (one-time wheel install)
NetworkOpenAI Realtime + LiveKit signalLiveKit signal only (localhost)
PrivacyAudio + transcripts go to OpenAIAudio stays on box
LanguagesOpenAI-supported31 (Supertonic) for TTS; Whisper coverage for STT
RAM~300 MB (just the avatar)~1.5 GB
Latency floor~400–600 ms (network)~800 ms first-audio (compute-bound)
Mobile pathnot applicableiOS / Android via the same .gguf / .bin / .onnx

Both modes use the same Rust CLI, the same brain entrypoint (bithuman.agent), and the same AvatarSession plugin. One env var is the only difference.

Performance (M5 MacBook Pro)

StageLatency (warm)
Pipeline warm load (all four models from cache)717 ms
Whisper STT (4.6 s utterance, tiny.en)131 ms
Qwen 2.5 0.5B (first token / 30-token reply at 96 tok/s)114 ms / 894 ms
Supertonic TTS (10.7 s of speech, RTF 0.16)1,729 ms
End-to-end (STT + LLM + TTS, sentence-blocking)2,754 ms
First audible word (streamed token-by-token + sentence-chunked TTS)~800 ms

Cold start (every-model first download, first run only) is ~90 s. Process warm-up after that is under a second.

What’s NOT in bithuman-cli[local]

  • Avatar generation — you still need an .imx file. Generate one on bithuman.ai (free tier) via POST /v1/agent/generate, or pull one from the showcase (bithuman pull <slug>~/.cache/bithuman/showcase/).
  • Avatar meteringbithuman run still pings api.bithuman.ai for credit accounting. Set BITHUMAN_UNMETERED=1 for dev parity testing; production deployments still need BITHUMAN_API_SECRET.
  • Voice cloning — Supertonic ships voice presets. Custom voice cloning is a hosted service that produces JSON style files; once you have the JSON, inference is local but the cloning step is not.
  • Tool / function calling — the default 0.5B LLM is too small for reliable JSON tool calls. Swap to ≥3B for that.

Troubleshooting

Is local mode actually wired up?

Run bithuman doctor — it verifies the [local] extras are importable and lists the resolved backend versions.

First run feels stuck

It is downloading ~860 MB from HuggingFace. Watch network activity; subsequent runs start in under a second.

BITHUMAN_LOCAL=1 errors with “required the local extras”

You installed bithuman-cli without the [local] extra. Re-install: pip install 'bithuman-cli[local]'.

LLM is too dumb

Bump to a bigger GGUF — see the tuning table above.

TTS pronounces the wrong language

Set BITHUMAN_LOCAL_LANG to match your transcript language. Default en.

Browser shows the avatar but no audio

Check the brain logs in the terminal — there is structured tracing (stt:transcribe duration_ms=…, llm:first_token, tts:synthesize). If stt:transcribe shows empty text, your mic or VAD is the issue; if llm:first_token never fires, the LLM did not load (usually the GGUF file name in BITHUMAN_LOCAL_LLM_FILE is wrong).

See also