Self-Hosted GPU: Expression Avatar Docker Container

Preview Feature — 2 credits per minute while using the GPU container.

On Apple Silicon? The Expression model also runs natively on macOS with M3+ chips — no Docker or NVIDIA GPU needed. See Expression on macOS.

Overview

The self-hosted GPU avatar container (docker.io/sgubithuman/expression-avatar:latest) enables production-grade avatar generation on your own GPU infrastructure.

Full Control — Complete control over deployment, scaling, and configuration
Cost Optimization — Pay only for the GPU resources you use
Data Privacy — Avatar images and audio never leave your infrastructure
Customization — Extend the worker with custom logic and integrations

How It Works

The container is a GPU worker that joins a LiveKit room and streams avatar video frames in real time. Your application calls the /launch endpoint with LiveKit room credentials and an avatar image; the container connects to the room, listens for audio, and generates lip-synced video at 25 FPS — entirely on your GPU.

Your Agent (LiveKit)
      │
      │  POST /launch
      │  { livekit_url, livekit_token, room_name, avatar_image }
      ▼
bitHuman GPU container
      │
      ├─ Joins LiveKit room as video publisher
      ├─ Receives audio from agent via data stream
      └─ Generates 25 FPS lip-synced video → streams to room
             ↑
         100% local GPU — no cloud calls during inference

Prerequisites

NVIDIA GPU with ≥8 GB VRAM
NVIDIA Container Toolkit installed
Docker 24+ with Compose v2
bitHuman API secret from Developer → API Keys
Model weights download automatically on first start (~5 GB, cached in Docker volume)
A running LiveKit server (or LiveKit Cloud)

Quick Start

Model weights download automatically on first run — just provide your API secret:

# 1. Pull the image (~360 MB)
docker pull docker.io/sgubithuman/expression-avatar:latest

# 2. Run — proprietary weights (~4.7 GB) download automatically on first start
docker run --gpus all -p 8089:8089 \
    -v bithuman-models:/data/models \
    -e BITHUMAN_API_SECRET=your_api_secret \
    docker.io/sgubithuman/expression-avatar:latest

# 3. Wait for startup (first run: ~3 min download + ~48s GPU compilation)
#    Subsequent starts: ~48s (weights already cached in the named volume)
curl http://localhost:8089/health
# {"status": "healthy", "active_sessions": 0, "max_sessions": 8}

The -v bithuman-models:/data/models named volume caches the downloaded weights so you only pay the download cost once. Once healthy, the container is ready to accept avatar sessions via /launch.

Docker Compose Setup

Use the full example for a complete setup with LiveKit, an AI agent, and a web frontend:

git clone https://github.com/bithuman-product/bithuman-examples.git
cd bithuman-examples/expression-selfhosted

# Configure environment
cp .env.example .env
# Edit .env with your API secret, OpenAI key, and avatar image

# Copy your avatar image into ./avatars/
mkdir -p avatars
cp /path/to/your/avatar.jpg avatars/

# Model weights download automatically on first run — nothing to pre-download!
docker compose up

Open http://localhost:4202 to start a conversation with your GPU avatar.

Integration Guide

The container exposes a simple HTTP API. Your LiveKit agent calls /launch to start an avatar session. There are two ways to integrate:

Option 1: LiveKit Python Plugin (Recommended)

Install the bitHuman LiveKit plugin:

pip install livekit-plugins-bithuman

In your LiveKit agent, point AvatarSession at your container’s /launch endpoint:

from livekit.agents import Agent, AgentSession, JobContext, WorkerOptions, WorkerType, cli
from livekit.plugins import bithuman, openai, silero

async def entrypoint(ctx: JobContext):
    await ctx.connect()
    await ctx.wait_for_participant()

    avatar = bithuman.AvatarSession(
        api_url="http://localhost:8089/launch",   # your container
        api_secret="your_api_secret",              # for billing
        avatar_image="/path/to/avatar.jpg",        # local file or HTTPS URL
    )

    session = AgentSession(
        llm=openai.realtime.RealtimeModel(voice="coral"),
        vad=silero.VAD.load(),
    )

    await avatar.start(session, room=ctx.room)
    await session.start(
        agent=Agent(instructions="You are a helpful assistant."),
        room=ctx.room,
    )

if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint, worker_type=WorkerType.ROOM))

The plugin handles room token generation and calls /launch automatically when a participant joins.

Option 2: Direct HTTP API

You can call /launch directly from any HTTP client. The container joins the LiveKit room as a video publisher.

# Generate a LiveKit room token first (using livekit-server-sdk or CLI)
TOKEN=$(livekit-token create --room my-room --identity avatar-worker \
    --api-key devkey --api-secret your-livekit-secret)

# Launch with an image URL
curl -X POST http://localhost:8089/launch \
  -F "livekit_url=ws://your-livekit-server:7880" \
  -F "livekit_token=$TOKEN" \
  -F "room_name=my-room" \
  -F "avatar_image_url=https://example.com/avatar.jpg"

# Or upload an image file directly
curl -X POST http://localhost:8089/launch \
  -F "livekit_url=ws://your-livekit-server:7880" \
  -F "livekit_token=$TOKEN" \
  -F "room_name=my-room" \
  -F "avatar_image=@./avatar.jpg"

Response (async by default):

{
  "status": "pending",
  "task_id": "a1b2c3d4",
  "room_name": "my-room"
}

The avatar is live in the room within ~4–6 seconds.

HTTP API Reference

All endpoints are served on port 8089 (default).

`POST /launch`

Start an avatar session for a LiveKit room. The container joins the room and begins streaming lip-synced video. Content-Type: multipart/form-data

Field	Type	Required	Description
`livekit_url`	string	Yes	LiveKit server WebSocket URL (e.g. `ws://livekit:7880`)
`livekit_token`	string	Yes	LiveKit room token with publish permissions
`room_name`	string	Yes	LiveKit room name (must match token)
`avatar_image`	file	No*	Avatar image file upload (JPEG/PNG)
`avatar_image_url`	string	No*	Avatar image HTTPS URL (alternative to file upload)
`prompt`	string	No	Motion prompt (default: `"A person is talking naturally."`)
`api_secret`	string	No	Override billing secret (defaults to `BITHUMAN_API_SECRET`)
`async_mode`	bool	No	Return immediately (`true`, default) or wait for session to end

*Provide either avatar_image or avatar_image_url. If neither is given, a default image is used. Response (async_mode=true):

{ "status": "pending", "task_id": "a1b2c3d4", "room_name": "my-room" }

Error responses:

503 Service Unavailable — container still initializing, or at session capacity
400 Bad Request — invalid image or download failed

`GET /health`

Lightweight health check. Always returns 200 once the container is running (even during model loading).

{
  "status": "healthy",
  "active_sessions": 2,
  "max_sessions": 8
}

`GET /ready`

Readiness check. Returns 200 only when the model is loaded and a session slot is available. Use this to gate traffic in load balancers or health checks.

{
  "status": "ready",
  "model_ready": true,
  "active_sessions": 2,
  "available_sessions": 6,
  "max_sessions": 8
}

Returns 503 with "status": "not_ready" during model loading, or "status": "at_capacity" when all session slots are in use.

`GET /tasks`

List all sessions (active and completed).

curl http://localhost:8089/tasks

{
  "tasks": [
    {
      "task_id": "a1b2c3d4",
      "room_name": "my-room",
      "status": "running",
      "created_at": "2024-01-01T12:00:00",
      "completed_at": null,
      "error": null
    }
  ]
}

`GET /tasks/{task_id}`

Check the status of a specific session.

{
  "task_id": "a1b2c3d4",
  "room_name": "my-room",
  "status": "running",
  "created_at": "2024-01-01T12:00:00",
  "completed_at": null,
  "error": null
}

Status values: pending → running → completed / failed / cancelled

`POST /tasks/{task_id}/stop`

Stop a running session and release the session slot.

curl -X POST http://localhost:8089/tasks/a1b2c3d4/stop

`POST /benchmark`

Run an inference benchmark and return per-stage timing. Useful for verifying GPU performance.

curl -X POST "http://localhost:8089/benchmark?iterations=10"

{
  "iterations": 10,
  "frames_per_generate": 24,
  "avg_ms": 79.3,
  "fps": 302.6,
  "vram_gb": 6.2,
  "gpu": "NVIDIA GPU"
}

`GET /test-frame`

Generate a few chunks and return the last frame as a JPEG. Useful for verifying the model is producing valid output.

curl http://localhost:8089/test-frame --output frame.jpg
open frame.jpg

Environment Variables

Variable	Required	Default	Description
`BITHUMAN_API_SECRET`	Yes	—	API secret for billing and weight download
`MAX_SESSIONS`	No	`8`	Max concurrent avatar sessions per container
`SESSION_IDLE_TIMEOUT`	No	`600`	Seconds an idle session is held before auto-release. Lower (e.g. `120`) if standby/widget sessions accumulate on shared GPUs.
`AUTO_PREWARM`	No	`true`	Load model and compile at startup. Set `false` only for cold-start testing.
`TASK_RETENTION_SECONDS`	No	`3600`	How long completed/failed task records are kept in `/tasks` history.
`MAX_TASK_HISTORY`	No	`1000`	Hard cap on retained terminal tasks; oldest are dropped first. Active sessions are never evicted.
`PYTORCH_CUDA_ALLOC_CONF`	No	`expandable_segments:True`	Set in image. Allows the CUDA allocator to grow/shrink dynamically — keeps VRAM steady across sessions. Don’t override unless you know why.
`CUDA_VISIBLE_DEVICES`	No	all GPUs	Restrict to specific GPU (e.g. `0`)

Without BITHUMAN_API_SECRET, avatar sessions will run but usage will not be tracked or billed. This is not permitted for production use.

Concurrency & Long-Running Containers

The container is designed to run continuously. The model is loaded once at startup (~5 GB GPU weights) and shared across all sessions; each session adds ~50 MB of per-session GPU state on top. This keeps VRAM steady — it does not grow with the number of sessions you’ve served, only with the number of sessions running concurrently.

Tuning concurrency

Set MAX_SESSIONS to match your GPU. Each active session uses roughly 600 MB of additional VRAM beyond the shared 5 GB baseline:

# 8 GB consumer GPU — 2 concurrent sessions
docker run --gpus all -e MAX_SESSIONS=2 -e BITHUMAN_API_SECRET=... ...

# 24 GB consumer GPU — 4 concurrent sessions
docker run --gpus all -e MAX_SESSIONS=4 -e BITHUMAN_API_SECRET=... ...

# 80 GB H100 — 8-9 concurrent sessions (production tier)
docker run --gpus all -e MAX_SESSIONS=9 -e BITHUMAN_API_SECRET=... ...

POST /launch returns 503 at_capacity when all slots are used; clients should back off or scale horizontally (run a second container).

Reclaim idle GPU slots faster

Sessions sit at idle for SESSION_IDLE_TIMEOUT seconds (default 10 min) before being auto-released back to the pool. If you embed the avatar in widgets where users open and abandon tabs, idle slots can accumulate and exhaust capacity:

# Reclaim idle slots after 2 minutes for high-churn embeds
-e SESSION_IDLE_TIMEOUT=120

Memory monitoring

Container RSS should stabilize within the first hour of runtime (after PyTorch’s CUDA allocator settles). If RSS keeps climbing beyond that:

# Watch container RSS and active sessions
docker stats expression-avatar --no-stream
curl -s http://localhost:8089/health | jq .

# Watch GPU VRAM and orphan processes
nvidia-smi
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv

VRAM > 7 GB on an idle container, or RSS climbing past 20 GB, indicates either too many concurrent sessions or zombie sessions — see Troubleshooting.

Production resource limits (recommended for compose)

For long-running deployments, cap container resources so a runaway container can’t take down the host:

services:
  expression-avatar:
    image: docker.io/sgubithuman/expression-avatar:latest
    init: true              # reap zombie subprocess on PID 1
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 20G        # hard cap RAM
        reservations:
          memory: 12G
          devices:
            - { driver: nvidia, count: 1, capabilities: [gpu] }
    shm_size: 2g
    logging:
      driver: json-file
      options:
        max-size: "50m"      # cap log volume
        max-file: "5"

The full reference compose at bithuman-examples/expression-selfhosted ships with all of these defaults.

Performance Characteristics

GPU Tier	VRAM Usage	Concurrent Sessions
High-end (data center)	~6 GB	up to 8 concurrent
High-end (consumer)	~6 GB	up to 4 concurrent
Mid-range	~6 GB	up to 2 concurrent

Configuration	Time to First Frame	Description
Long-running container	~4–6 seconds	Model loaded at startup; new sessions encode image (~2s) then stream
Cold start	~48 seconds	Full GPU model compilation on first start (cached on subsequent starts)

Long-Running Containers (Recommended)

Keep the container running between sessions. The model loads once at startup (~48s including GPU compilation), and subsequent sessions start in ~4–6 seconds.

docker run --gpus all -p 8089:8089 --restart always \
    -v bithuman-models:/data/models \
    -e BITHUMAN_API_SECRET=your_api_secret \
    docker.io/sgubithuman/expression-avatar:latest

Troubleshooting

Problem	Solution
Container won’t start	Check GPU: `nvidia-smi`; check logs: `docker logs <id>`
First start takes >5 minutes	Normal — weights are downloading (~4.7 GB). Check logs for download progress.
Download fails with 401	Verify `BITHUMAN_API_SECRET` is set and valid
Download fails with connection error	Check outbound internet access from the container
`/health` returns connection refused	Container still initializing — check `docker logs <id>` for startup progress
`/launch` returns `503 not_ready`	Model still loading — poll `/ready` until `model_ready: true`
`/launch` returns `503 at_capacity`	All session slots in use; increase `MAX_SESSIONS` or scale horizontally
Startup takes >2 minutes (after download)	GPU compilation runs once per container — subsequent starts reuse compiled cache
Out of memory (GPU)	Use a GPU with ≥8 GB VRAM; reduce `MAX_SESSIONS` if needed
Container RSS keeps climbing	Check `/tasks` count — if growing without bound, ensure you are on a recent image (`docker pull sgubithuman/expression-avatar:latest`) which prunes terminal task records every 5 minutes. Lower `MAX_TASK_HISTORY` if needed.
VRAM creeps up over time on idle container	Idle/standby sessions are not being reclaimed. Lower `SESSION_IDLE_TIMEOUT` (e.g. `120`s for widget embeds).
Sessions stuck at “running” forever	Check the LiveKit room — the client may have disconnected without the worker noticing. `POST /tasks/{id}/stop` to force release. Idle timeout will also reclaim them after `SESSION_IDLE_TIMEOUT`.
Orphan GPU processes after container restart	`nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv` — kill leftover PIDs (`kill -9 <pid>`). Use `init: true` in compose to ensure PID 1 reaps subprocesses.
`503 at_capacity` under load	Increase `MAX_SESSIONS` (if VRAM allows) or run a second container; round-robin via reverse proxy.
Container memory spikes during inference but doesn’t release	Normal for PyTorch’s caching allocator. With `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` (default), it will return to baseline within minutes of idle.
Billing not working	Verify `BITHUMAN_API_SECRET` is set and valid
Avatar image not showing	Check `/test-frame` — if it returns a valid JPEG, image encoding is working

Getting Started

Swift SDK

Deployment

Integrations

Changelog

Self-Hosted GPU: Expression Avatar Docker Container

Overview

How It Works

Prerequisites

Quick Start

Docker Compose Setup

Integration Guide

Option 1: LiveKit Python Plugin (Recommended)

Option 2: Direct HTTP API

HTTP API Reference

`POST /launch`

`GET /health`

`GET /ready`

`GET /tasks`

`GET /tasks/{task_id}`

`POST /tasks/{task_id}/stop`

`POST /benchmark`

`GET /test-frame`

Environment Variables

Concurrency & Long-Running Containers

Tuning concurrency

Reclaim idle GPU slots faster

Memory monitoring

Production resource limits (recommended for compose)

Performance Characteristics

Long-Running Containers (Recommended)

Troubleshooting

Next Steps

Docker Example

LiveKit Agents

Getting Started

Swift SDK

Deployment

Integrations

Changelog

Documentation Index

​Overview

​How It Works

​Prerequisites

​Quick Start

​Docker Compose Setup

​Integration Guide

​Option 1: LiveKit Python Plugin (Recommended)

​Option 2: Direct HTTP API

​HTTP API Reference

​POST /launch

​GET /health

​GET /ready

​GET /tasks

​GET /tasks/{task_id}

​POST /tasks/{task_id}/stop

​POST /benchmark

​GET /test-frame

​Environment Variables

​Concurrency & Long-Running Containers

​Tuning concurrency

​Reclaim idle GPU slots faster

​Memory monitoring

​Production resource limits (recommended for compose)

​Performance Characteristics

​Long-Running Containers (Recommended)

​Troubleshooting

​Next Steps

Docker Example

LiveKit Agents

Overview

How It Works

Prerequisites

Quick Start

Docker Compose Setup

Integration Guide

Option 1: LiveKit Python Plugin (Recommended)

Option 2: Direct HTTP API

HTTP API Reference

`POST /launch`

`GET /health`

`GET /ready`

`GET /tasks`

`GET /tasks/{task_id}`

`POST /tasks/{task_id}/stop`

`POST /benchmark`

`GET /test-frame`

Environment Variables

Concurrency & Long-Running Containers

Tuning concurrency

Reclaim idle GPU slots faster

Memory monitoring

Production resource limits (recommended for compose)

Performance Characteristics

Long-Running Containers (Recommended)

Troubleshooting

Next Steps