Essence vs Expression

The two bitHuman avatar model families, Essence and Expression — the first-generation models, what each does, where each runs (on-device CPU, Raspberry Pi, Apple Silicon, or NVIDIA GPU), and which one to pick.

The engines

bitHuman’s avatar runtime is a family of rendering engines plus the conversation and voice stack that feeds them. The two render engines you choose between when packaging an avatar — and the focus of the rest of this page — are Essence and Expression.

Rendering engines — two product families, each with tiers:

Essence — the avatar family (a packaged .imx identity with real-time lip-sync):
- Essence 1 — first generation. Pre-built identity, runs on virtually any CPU. (No longer the creation default: as of 2026-07-12 an omitted model on /v1/agent/generate defaults to Expression 1 — still a v1 engine at the same 250-credit rate, never a v2 engine or a higher price. For new photoreal work the recommended model is Essence 2.)
- Essence 2 — the standard photoreal model and the default: a distilled renderer that runs everywhere (GPU, CPU, the Apple Neural Engine — including fully on-device — and in-browser WebGPU/WASM).
- Essence 2 Max — the premium model: the highest-fidelity renderer served on dedicated cloud GPUs.
Expression — the expressive family (animation driven from a portrait at runtime):
- Expression 1 — first generation. Dynamic facial animation from any portrait image (Apple Silicon or NVIDIA GPU).
- Expression 2 — the second-generation generative engine: audio-driven, fully-generated motion from a single photo rather than patching a pre-rendered base. Serves on gpu, cpu, and ane tiers.

New The two second-generation models — essence-2 and expression-2 — are available now (launched July 10, 2026). See Essence 2 & Expression 2 for the family overview, and the official per-model guides: Expression 2 · Essence 2 · Essence 2 Max.

Each family shares one .imx format, SDK methods, and the push audio → drain frames shape; the tier is selected per session and is transparent to your integration.

Conversation + voice stack — drives a managed agent and feeds the renderers:

Converse — the STT → LLM → TTS turn loop that drives a managed agent’s dialogue. It produces the audio that the renderers lip-sync.
Voice — the speech engine (the voice/TTS stack behind audio-only chat and the voices you select for an agent).

The rest of this page focuses on the first-generation models, Essence 1 vs Expression 1 — the numbers and hardware notes below are theirs. For the second generation, see Essence 2 & Expression 2.

At a glance

bitHuman’s two first-generation avatar models share the same .imx file format, the same SDK methods, and the same push audio → drain frames shape. Essence 1 runs on virtually every CPU and is what bithuman pull ships in the showcase. Expression 1 is the heavier high-fidelity option for specific on-device Apple Silicon or GPU server use cases.

	Essence 1	Expression 1
What it does	Pre-built avatar identity packaged in an `.imx` file. Real-time lip-sync.	Dynamic facial animation from any portrait image at runtime.
Avatar source	`.imx` you build once from a photo (the identity video is generated internally).	Any face image — provide at runtime, no build step.
Custom gestures	Yes (wave, nod, laugh, etc.)	No
Idle animation	Pre-recorded natural movement	AI-generated micro-movements
Compute needed	Any modern CPU	Apple Silicon M3+ (demo apps) or NVIDIA GPU
Memory footprint	Low (~200–500 MB)	Higher (~2–6 GB)
Best for	Kiosks, mobile, edge, 24/7 deployments, high concurrency	Close-up native consumer apps, custom faces per session
Pricing (first-generation rates)	1 credit/min self-hosted · 2 credits/min cloud	2 credits/min self-hosted · 4 credits/min cloud

Both ship to every surface — SDKs, REST API, LiveKit plugin, CLI, on-device, embed widget. The same .imx file works everywhere.

Where each model runs

Surface	Essence	Expression
iOS / iPadOS	iPhone 16 Pro+, iPad Pro M4+	iPad Pro M4+ (iPhone 16 Pro+ preview)
macOS arm64	Any Apple Silicon	M3+
macOS Intel	Pending (2.3 ships arm64 only)	—
Linux x86_64 / aarch64	Any modern CPU	via NVIDIA GPU (Docker)
Windows	Pending (use WSL2 today)	—
Raspberry Pi 4B+	Supported	—
bitHuman Cloud	Managed	Managed
Self-hosted CPU	Python SDK / LiveKit plugin	—
Self-hosted GPU	—	Docker container

Native macOS-Intel and Windows wheels are pending for the 2.3 line; the architecture page tracks per-platform shipping status. On iPhone, Essence delivers a fast, real-time on-device avatar; Expression’s heavier renderer targets iPad Pro and Mac (iPhone is in preview).

Essence

Essence packages a complete avatar identity (face, body, gestures) into an .imx file. At runtime, the SDK plays back pre-rendered base motion and patches the mouth region in real time to match incoming audio.

Runtime characteristics

~200–500 MB resident, 1–2 CPU cores, real-time at 25 FPS.
Runs on macOS arm64, Linux x86_64 / aarch64, iOS, iPadOS, Raspberry Pi 4B+, and in the browser via WASM.
No idle timeout — sessions can run 24/7. Reliable for unattended kiosks and lobby displays.
Supports custom gestures (wave, nod, laugh) triggered by keywords or API.
Predictable, consistent behavior. Lower per-stream cost — the right pick for high-concurrency self-hosted deployments.

Try it from the showcase

The CLI ships a curated set of ready-to-run Essence .imx avatars:

bithuman list                          # browse the showcase
bithuman pull modern-court-jester      # downloads to ~/.cache/bithuman/showcase/<slug>.imx
bithuman run modern-court-jester.imx   # live browser-served avatar

How to ship it

Python SDK — self-host on macOS arm64 + Linux x86_64 / aarch64.
Swift SDK — native Mac, iPad, iPhone apps.
bitHuman CLI — no code, terminal or browser.
REST API — backend integration in any language.
Cloud LiveKit plugin — managed, no infrastructure.
Embed widget — drop-in iframe for websites.

Expression

Expression generates real-time facial animation directly from a portrait image. The face can change between sessions or even mid-session — no avatar build step is required.

Runtime characteristics

~2–6 GB resident; needs Apple Silicon M3+ (Mac) / M4+ (iPad Pro) or an NVIDIA GPU (8 GB+ VRAM).
Works with any face image — drag-and-drop swap, photo, video frame, anything.
AI-driven expressions adapt to speech content and emotional context.
Higher visual fidelity for close-up conversational interactions.
On-device demo apps target macOS M3+ and iPad Pro M4+ today; iPhone Expression and macOS-Intel are on the way.
On Apple Silicon the Swift SDK auto-spawns a bithuman-expression-daemon subprocess to drive the model.

How to ship it

Cloud LiveKit plugin — bitHuman hosts the GPU worker (set model="expression").
Self-hosted GPU — your own NVIDIA GPU via the Docker container.
On-device macOS / iPadOS — Apple Silicon M3+, via the Swift SDK.
bitHuman CLI — bithuman run with an Expression .imx.
REST API — same endpoint as Essence; the model is selected per agent.

Which should I use?

24/7 kiosk or always-on display

Essence. No idle timeout, runs on CPU, predictable for unattended deployments.

iPhone app

Essence. On iPhone, choose Essence; iPad and Mac are the on-device homes for Expression.

Native Mac or iPad app with close-up dynamic faces

Expression on-device via the Swift SDK or the Mac/iPad reference apps.

Need custom gestures (wave, nod, laugh)

Essence. Essence supports custom gestures — wave, nod, laugh — triggered by keyword or API.

Quickest setup with any face photo

Expression via the cloud plugin. Pass the image at session start — no build step.

Voice agent on LiveKit with maximum concurrency

Essence. Lower per-stream cost makes it the right pick for high-concurrency deployments.

Edge hardware (Raspberry Pi, low-power laptop)

Essence. Runs on 1–2 CPU cores at 25 FPS.

Highest visual quality for offline video generation

Talking video generation — render a finished mp4 with any model, including Essence 2 Max for premium fidelity. Best for offline batch jobs rather than real-time streaming.

Next steps

Essence 2 & Expression 2 — the second-generation models essence-2 and expression-2 (launching July 10, 2026), with per-model guides: Expression 2, Essence 2, Essence 2 Max.
Building avatars — get or generate your first avatar.
Pricing & credits — what each model costs to run.
SDK overview — run a model on your own hardware.
Architecture — engine layering and the full per-platform device matrix.
Avatars and the .imx format — how avatars are packaged.