Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.bithuman.ai/llms.txt

Use this file to discover all available pages before exploring further.

Preview Feature — 2 credits per minute while using the GPU container.
On Apple Silicon? The Expression model also runs natively on macOS with M3+ chips — no Docker or NVIDIA GPU needed. See Expression on macOS.

Overview

The self-hosted GPU avatar container (docker.io/sgubithuman/expression-avatar:latest) enables production-grade avatar generation on your own GPU infrastructure.
  • Full Control — Complete control over deployment, scaling, and configuration
  • Cost Optimization — Pay only for the GPU resources you use
  • Data Privacy — Avatar images and audio never leave your infrastructure
  • Customization — Extend the worker with custom logic and integrations

How It Works

The container is a GPU worker that joins a LiveKit room and streams avatar video frames in real time. Your application calls the /launch endpoint with LiveKit room credentials and an avatar image; the container connects to the room, listens for audio, and generates lip-synced video at 25 FPS — entirely on your GPU.
Your Agent (LiveKit)

      │  POST /launch
      │  { livekit_url, livekit_token, room_name, avatar_image }

bitHuman GPU container

      ├─ Joins LiveKit room as video publisher
      ├─ Receives audio from agent via data stream
      └─ Generates 25 FPS lip-synced video → streams to room

         100% local GPU — no cloud calls during inference

Prerequisites


Quick Start

Model weights download automatically on first run — just provide your API secret:
# 1. Pull the image (~360 MB)
docker pull docker.io/sgubithuman/expression-avatar:latest

# 2. Run — proprietary weights (~4.7 GB) download automatically on first start
docker run --gpus all -p 8089:8089 \
    -v bithuman-models:/data/models \
    -e BITHUMAN_API_SECRET=your_api_secret \
    docker.io/sgubithuman/expression-avatar:latest
# 3. Wait for startup (first run: ~3 min download + ~48s GPU compilation)
#    Subsequent starts: ~48s (weights already cached in the named volume)
curl http://localhost:8089/health
# {"status": "healthy", "active_sessions": 0, "max_sessions": 8}
The -v bithuman-models:/data/models named volume caches the downloaded weights so you only pay the download cost once. Once healthy, the container is ready to accept avatar sessions via /launch.

Docker Compose Setup

Use the full example for a complete setup with LiveKit, an AI agent, and a web frontend:
git clone https://github.com/bithuman-product/bithuman-examples.git
cd bithuman-examples/expression-selfhosted

# Configure environment
cp .env.example .env
# Edit .env with your API secret, OpenAI key, and avatar image

# Copy your avatar image into ./avatars/
mkdir -p avatars
cp /path/to/your/avatar.jpg avatars/

# Model weights download automatically on first run — nothing to pre-download!
docker compose up
Open http://localhost:4202 to start a conversation with your GPU avatar.

Integration Guide

The container exposes a simple HTTP API. Your LiveKit agent calls /launch to start an avatar session. There are two ways to integrate: Install the bitHuman LiveKit plugin:
pip install livekit-plugins-bithuman
In your LiveKit agent, point AvatarSession at your container’s /launch endpoint:
from livekit.agents import Agent, AgentSession, JobContext, WorkerOptions, WorkerType, cli
from livekit.plugins import bithuman, openai, silero

async def entrypoint(ctx: JobContext):
    await ctx.connect()
    await ctx.wait_for_participant()

    avatar = bithuman.AvatarSession(
        api_url="http://localhost:8089/launch",   # your container
        api_secret="your_api_secret",              # for billing
        avatar_image="/path/to/avatar.jpg",        # local file or HTTPS URL
    )

    session = AgentSession(
        llm=openai.realtime.RealtimeModel(voice="coral"),
        vad=silero.VAD.load(),
    )

    await avatar.start(session, room=ctx.room)
    await session.start(
        agent=Agent(instructions="You are a helpful assistant."),
        room=ctx.room,
    )

if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint, worker_type=WorkerType.ROOM))
The plugin handles room token generation and calls /launch automatically when a participant joins.

Option 2: Direct HTTP API

You can call /launch directly from any HTTP client. The container joins the LiveKit room as a video publisher.
# Generate a LiveKit room token first (using livekit-server-sdk or CLI)
TOKEN=$(livekit-token create --room my-room --identity avatar-worker \
    --api-key devkey --api-secret your-livekit-secret)

# Launch with an image URL
curl -X POST http://localhost:8089/launch \
  -F "livekit_url=ws://your-livekit-server:7880" \
  -F "livekit_token=$TOKEN" \
  -F "room_name=my-room" \
  -F "avatar_image_url=https://example.com/avatar.jpg"

# Or upload an image file directly
curl -X POST http://localhost:8089/launch \
  -F "livekit_url=ws://your-livekit-server:7880" \
  -F "livekit_token=$TOKEN" \
  -F "room_name=my-room" \
  -F "avatar_image=@./avatar.jpg"
Response (async by default):
{
  "status": "pending",
  "task_id": "a1b2c3d4",
  "room_name": "my-room"
}
The avatar is live in the room within ~4–6 seconds.

HTTP API Reference

All endpoints are served on port 8089 (default).

POST /launch

Start an avatar session for a LiveKit room. The container joins the room and begins streaming lip-synced video. Content-Type: multipart/form-data
FieldTypeRequiredDescription
livekit_urlstringYesLiveKit server WebSocket URL (e.g. ws://livekit:7880)
livekit_tokenstringYesLiveKit room token with publish permissions
room_namestringYesLiveKit room name (must match token)
avatar_imagefileNo*Avatar image file upload (JPEG/PNG)
avatar_image_urlstringNo*Avatar image HTTPS URL (alternative to file upload)
promptstringNoMotion prompt (default: "A person is talking naturally.")
api_secretstringNoOverride billing secret (defaults to BITHUMAN_API_SECRET)
async_modeboolNoReturn immediately (true, default) or wait for session to end
*Provide either avatar_image or avatar_image_url. If neither is given, a default image is used. Response (async_mode=true):
{ "status": "pending", "task_id": "a1b2c3d4", "room_name": "my-room" }
Error responses:
  • 503 Service Unavailable — container still initializing, or at session capacity
  • 400 Bad Request — invalid image or download failed

GET /health

Lightweight health check. Always returns 200 once the container is running (even during model loading).
{
  "status": "healthy",
  "active_sessions": 2,
  "max_sessions": 8
}

GET /ready

Readiness check. Returns 200 only when the model is loaded and a session slot is available. Use this to gate traffic in load balancers or health checks.
{
  "status": "ready",
  "model_ready": true,
  "active_sessions": 2,
  "available_sessions": 6,
  "max_sessions": 8
}
Returns 503 with "status": "not_ready" during model loading, or "status": "at_capacity" when all session slots are in use.

GET /tasks

List all sessions (active and completed).
curl http://localhost:8089/tasks
{
  "tasks": [
    {
      "task_id": "a1b2c3d4",
      "room_name": "my-room",
      "status": "running",
      "created_at": "2024-01-01T12:00:00",
      "completed_at": null,
      "error": null
    }
  ]
}

GET /tasks/{task_id}

Check the status of a specific session.
{
  "task_id": "a1b2c3d4",
  "room_name": "my-room",
  "status": "running",
  "created_at": "2024-01-01T12:00:00",
  "completed_at": null,
  "error": null
}
Status values: pendingrunningcompleted / failed / cancelled

POST /tasks/{task_id}/stop

Stop a running session and release the session slot.
curl -X POST http://localhost:8089/tasks/a1b2c3d4/stop

POST /benchmark

Run an inference benchmark and return per-stage timing. Useful for verifying GPU performance.
curl -X POST "http://localhost:8089/benchmark?iterations=10"
{
  "iterations": 10,
  "frames_per_generate": 24,
  "avg_ms": 79.3,
  "fps": 302.6,
  "vram_gb": 6.2,
  "gpu": "NVIDIA GPU"
}

GET /test-frame

Generate a few chunks and return the last frame as a JPEG. Useful for verifying the model is producing valid output.
curl http://localhost:8089/test-frame --output frame.jpg
open frame.jpg

Environment Variables

VariableRequiredDefaultDescription
BITHUMAN_API_SECRETYesAPI secret for billing and weight download
MAX_SESSIONSNo8Max concurrent avatar sessions per container
SESSION_IDLE_TIMEOUTNo600Seconds an idle session is held before auto-release. Lower (e.g. 120) if standby/widget sessions accumulate on shared GPUs.
AUTO_PREWARMNotrueLoad model and compile at startup. Set false only for cold-start testing.
TASK_RETENTION_SECONDSNo3600How long completed/failed task records are kept in /tasks history.
MAX_TASK_HISTORYNo1000Hard cap on retained terminal tasks; oldest are dropped first. Active sessions are never evicted.
PYTORCH_CUDA_ALLOC_CONFNoexpandable_segments:TrueSet in image. Allows the CUDA allocator to grow/shrink dynamically — keeps VRAM steady across sessions. Don’t override unless you know why.
CUDA_VISIBLE_DEVICESNoall GPUsRestrict to specific GPU (e.g. 0)
Without BITHUMAN_API_SECRET, avatar sessions will run but usage will not be tracked or billed. This is not permitted for production use.

Concurrency & Long-Running Containers

The container is designed to run continuously. The model is loaded once at startup (~5 GB GPU weights) and shared across all sessions; each session adds ~50 MB of per-session GPU state on top. This keeps VRAM steady — it does not grow with the number of sessions you’ve served, only with the number of sessions running concurrently.

Tuning concurrency

Set MAX_SESSIONS to match your GPU. Each active session uses roughly 600 MB of additional VRAM beyond the shared 5 GB baseline:
# 8 GB consumer GPU — 2 concurrent sessions
docker run --gpus all -e MAX_SESSIONS=2 -e BITHUMAN_API_SECRET=... ...

# 24 GB consumer GPU — 4 concurrent sessions
docker run --gpus all -e MAX_SESSIONS=4 -e BITHUMAN_API_SECRET=... ...

# 80 GB H100 — 8-9 concurrent sessions (production tier)
docker run --gpus all -e MAX_SESSIONS=9 -e BITHUMAN_API_SECRET=... ...
POST /launch returns 503 at_capacity when all slots are used; clients should back off or scale horizontally (run a second container).

Reclaim idle GPU slots faster

Sessions sit at idle for SESSION_IDLE_TIMEOUT seconds (default 10 min) before being auto-released back to the pool. If you embed the avatar in widgets where users open and abandon tabs, idle slots can accumulate and exhaust capacity:
# Reclaim idle slots after 2 minutes for high-churn embeds
-e SESSION_IDLE_TIMEOUT=120

Memory monitoring

Container RSS should stabilize within the first hour of runtime (after PyTorch’s CUDA allocator settles). If RSS keeps climbing beyond that:
# Watch container RSS and active sessions
docker stats expression-avatar --no-stream
curl -s http://localhost:8089/health | jq .

# Watch GPU VRAM and orphan processes
nvidia-smi
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv
VRAM > 7 GB on an idle container, or RSS climbing past 20 GB, indicates either too many concurrent sessions or zombie sessions — see Troubleshooting. For long-running deployments, cap container resources so a runaway container can’t take down the host:
services:
  expression-avatar:
    image: docker.io/sgubithuman/expression-avatar:latest
    init: true              # reap zombie subprocess on PID 1
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 20G        # hard cap RAM
        reservations:
          memory: 12G
          devices:
            - { driver: nvidia, count: 1, capabilities: [gpu] }
    shm_size: 2g
    logging:
      driver: json-file
      options:
        max-size: "50m"      # cap log volume
        max-file: "5"
The full reference compose at bithuman-examples/expression-selfhosted ships with all of these defaults.

Performance Characteristics

GPU TierVRAM UsageConcurrent Sessions
High-end (data center)~6 GBup to 8 concurrent
High-end (consumer)~6 GBup to 4 concurrent
Mid-range~6 GBup to 2 concurrent
ConfigurationTime to First FrameDescription
Long-running container~4–6 secondsModel loaded at startup; new sessions encode image (~2s) then stream
Cold start~48 secondsFull GPU model compilation on first start (cached on subsequent starts)
Keep the container running between sessions. The model loads once at startup (~48s including GPU compilation), and subsequent sessions start in ~4–6 seconds.
docker run --gpus all -p 8089:8089 --restart always \
    -v bithuman-models:/data/models \
    -e BITHUMAN_API_SECRET=your_api_secret \
    docker.io/sgubithuman/expression-avatar:latest

Troubleshooting

ProblemSolution
Container won’t startCheck GPU: nvidia-smi; check logs: docker logs <id>
First start takes >5 minutesNormal — weights are downloading (~4.7 GB). Check logs for download progress.
Download fails with 401Verify BITHUMAN_API_SECRET is set and valid
Download fails with connection errorCheck outbound internet access from the container
/health returns connection refusedContainer still initializing — check docker logs <id> for startup progress
/launch returns 503 not_readyModel still loading — poll /ready until model_ready: true
/launch returns 503 at_capacityAll session slots in use; increase MAX_SESSIONS or scale horizontally
Startup takes >2 minutes (after download)GPU compilation runs once per container — subsequent starts reuse compiled cache
Out of memory (GPU)Use a GPU with ≥8 GB VRAM; reduce MAX_SESSIONS if needed
Container RSS keeps climbingCheck /tasks count — if growing without bound, ensure you are on a recent image (docker pull sgubithuman/expression-avatar:latest) which prunes terminal task records every 5 minutes. Lower MAX_TASK_HISTORY if needed.
VRAM creeps up over time on idle containerIdle/standby sessions are not being reclaimed. Lower SESSION_IDLE_TIMEOUT (e.g. 120s for widget embeds).
Sessions stuck at “running” foreverCheck the LiveKit room — the client may have disconnected without the worker noticing. POST /tasks/{id}/stop to force release. Idle timeout will also reclaim them after SESSION_IDLE_TIMEOUT.
Orphan GPU processes after container restartnvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv — kill leftover PIDs (kill -9 <pid>). Use init: true in compose to ensure PID 1 reaps subprocesses.
503 at_capacity under loadIncrease MAX_SESSIONS (if VRAM allows) or run a second container; round-robin via reverse proxy.
Container memory spikes during inference but doesn’t releaseNormal for PyTorch’s caching allocator. With PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True (default), it will return to baseline within minutes of idle.
Billing not workingVerify BITHUMAN_API_SECRET is set and valid
Avatar image not showingCheck /test-frame — if it returns a valid JPEG, image encoding is working

Next Steps

Docker Example

Full GPU setup with LiveKit server, AI agent, and web frontend

LiveKit Agents

LiveKit Agents Python SDK documentation