Essence vs Expression

The two bitHuman avatar models — what each does, where each runs, and which one to pick.

The five engines

bitHuman is built on five engines. Two of them — Essence and Expression — are the avatar-rendering models this page compares in depth; the other three round out the stack:

Essence — the default avatar renderer. Pre-built .imx identity, real-time lip-sync, runs on virtually any CPU.
Expression — the heavier, high-fidelity renderer that animates any portrait at runtime on Apple Silicon or an NVIDIA GPU.
Converse — the conversation engine: the STT → LLM → TTS turn loop that drives a managed agent’s dialogue. It produces the audio that Essence/Expression lip-sync.
Elevate — the on-device expression/animation engine (vendored as libelevate), used by AvatarUIKit and the expression/iphone sample app. It was removed from the cloud model family but is retained on-device.
Voice — the speech engine (the voice/TTS stack behind audio-only chat and the voices you select for an agent).

The rest of this page focuses on Essence vs Expression — the two you choose between when packaging an avatar.

At a glance

bitHuman’s two avatar models share the same .imx file format, the same SDK methods, and the same push audio → drain frames shape. Essence is the default — it runs on virtually every CPU and is what bithuman pull ships in the showcase. Expression is the heavier high-fidelity option for specific on-device Apple Silicon or GPU server use cases.

	Essence (default)	Expression
What it does	Pre-built avatar identity packaged in an `.imx` file. Real-time lip-sync.	Dynamic facial animation from any portrait image at runtime.
Avatar source	`.imx` you build once from a photo or video.	Any face image — provide at runtime, no build step.
Custom gestures	Yes (wave, nod, laugh, etc.)	No
Idle animation	Pre-recorded natural movement	AI-generated micro-movements
Compute needed	Any modern CPU	Apple Silicon M3+ (demo apps) or NVIDIA GPU
Memory footprint	Low (~200–500 MB)	Higher (~2–6 GB)
Best for	Kiosks, mobile, edge, 24/7 deployments, high concurrency	Close-up native consumer apps, custom faces per session
Pricing	1 credit/min self-hosted · 2 credits/min cloud	2 credits/min self-hosted · 4 credits/min cloud

Both ship to every surface — SDKs, REST API, LiveKit plugin, CLI, on-device, embed widget. The same .imx file works everywhere.

Where each model runs

Surface	Essence	Expression
iOS / iPadOS	iPhone 16 Pro+, iPad Pro M4+	iPad Pro M4+ (iPhone 16 Pro+ preview)
macOS arm64	Any Apple Silicon	M3+
macOS Intel	Pending (2.3 ships arm64 only)	—
Android	`arm64-v8a`, Android 10+	—
Linux x86_64 / aarch64	Any modern CPU	via NVIDIA GPU (Docker)
Windows	Pending (use WSL2 today)	—
Raspberry Pi 4B+	Supported	—
bitHuman Cloud	Managed	Managed
Self-hosted CPU	Python SDK / LiveKit plugin	—
Self-hosted GPU	—	Docker container

Native macOS-Intel and Windows wheels are pending for the 2.3 line; the architecture page tracks per-platform shipping status. iPhone Expression is not currently supported — use Essence on iPhone.

Essence

Essence packages a complete avatar identity (face, body, gestures) into an .imx file. At runtime, the SDK plays back pre-rendered base motion and patches the mouth region in real time to match incoming audio.

Runtime characteristics

~200–500 MB resident, 1–2 CPU cores, real-time at 25 FPS.
Runs on macOS arm64, Linux x86_64 / aarch64, iOS, iPadOS, Android, Raspberry Pi 4B+, and in the browser via WASM.
No idle timeout — sessions can run 24/7. Reliable for unattended kiosks and lobby displays.
Supports custom gestures (wave, nod, laugh) triggered by keywords or API.
Predictable, consistent behavior. Lower per-stream cost — the right pick for high-concurrency self-hosted deployments.

Try it from the showcase

The CLI ships a curated set of ready-to-run Essence .imx avatars:

bithuman list                          # browse the showcase
bithuman pull modern-court-jester      # downloads to ~/.cache/bithuman/showcase/<slug>.imx
bithuman run modern-court-jester.imx   # live browser-served avatar

How to ship it

Python SDK — self-host on macOS arm64 + Linux x86_64 / aarch64.
Swift SDK — native Mac, iPad, iPhone apps.
Kotlin SDK — native Android apps (Beta).
bitHuman CLI — no code, terminal or browser.
REST API — backend integration in any language.
Cloud LiveKit plugin — managed, no infrastructure.
Embed widget — drop-in iframe for websites.

Expression

Expression generates real-time facial animation directly from a portrait image. The face can change between sessions or even mid-session — no avatar build step is required.

Runtime characteristics

~2–6 GB resident; needs Apple Silicon M3+ (Mac) / M4+ (iPad Pro) or an NVIDIA GPU (8 GB+ VRAM).
Works with any face image — drag-and-drop swap, photo, video frame, anything.
AI-driven expressions adapt to speech content and emotional context.
Higher visual fidelity for close-up conversational interactions.
On-device demo apps target macOS M3+ and iPad Pro M4+. iPhone Expression and macOS-Intel are not currently supported.
On Apple Silicon the Swift SDK auto-spawns a bithuman-expression-daemon subprocess to drive the model.

How to ship it

Cloud LiveKit plugin — bitHuman hosts the GPU worker (set model="expression").
Self-hosted GPU — your own NVIDIA GPU via the Docker container.
On-device macOS / iPadOS — Apple Silicon M3+, via the Swift SDK.
bitHuman CLI — bithuman run with an Expression .imx.
REST API — same endpoint as Essence; the model is selected per agent.

Which should I use?

24/7 kiosk or always-on display

Essence. No idle timeout, runs on CPU, predictable for unattended deployments.

iPhone app

Essence. Expression on iPhone isn’t currently supported — iPad and Mac are the on-device Expression hosts.

Android app

Essence via the Kotlin SDK (Beta).

Native Mac or iPad app with close-up dynamic faces

Expression on-device via the Swift SDK or the Mac/iPad reference apps.

Need custom gestures (wave, nod, laugh)

Essence. Expression doesn’t support gesture triggers.

Quickest setup with any face photo

Expression via the cloud plugin. Pass the image at session start — no build step.

Voice agent on LiveKit with maximum concurrency

Essence. Lower per-stream cost makes it the right pick for high-concurrency deployments.

Edge hardware (Raspberry Pi, low-power laptop)

Essence. Runs on 1–2 CPU cores at 25 FPS.

Highest visual quality for offline video generation

Expression with quality="high". Best for offline batch jobs rather than real-time streaming.

Where to go next

Quickstart — get your first avatar running in ~2 minutes.
Architecture — engine layering and the full per-platform device matrix.
Pricing — credits, tiers, and what’s metered.
Avatars and the .imx format — how avatars are packaged.