Documentation Index
Fetch the complete documentation index at: https://docs.bithuman.ai/llms.txt
Use this file to discover all available pages before exploring further.
Preview Feature — 2 credits per minute while using the GPU container.
Overview
The self-hosted GPU avatar container (docker.io/sgubithuman/expression-avatar:latest) enables production-grade avatar generation on your own GPU infrastructure.
- Full Control — Complete control over deployment, scaling, and configuration
- Cost Optimization — Pay only for the GPU resources you use
- Data Privacy — Avatar images and audio never leave your infrastructure
- Customization — Extend the worker with custom logic and integrations
How It Works
The container is a GPU worker that joins a LiveKit room and streams avatar video frames in real time. Your application calls the/launch endpoint with LiveKit room credentials and an avatar image; the container connects to the room, listens for audio, and generates lip-synced video at 25 FPS — entirely on your GPU.
Prerequisites
- NVIDIA GPU with ≥8 GB VRAM
- NVIDIA Container Toolkit installed
- Docker 24+ with Compose v2
- bitHuman API secret from Developer → API Keys
- Model weights download automatically on first start (~5 GB, cached in Docker volume)
- A running LiveKit server (or LiveKit Cloud)
Quick Start
Model weights download automatically on first run — just provide your API secret:-v bithuman-models:/data/models named volume caches the downloaded weights so you only pay the download cost once.
Once healthy, the container is ready to accept avatar sessions via /launch.
Docker Compose Setup
Use the full example for a complete setup with LiveKit, an AI agent, and a web frontend:http://localhost:4202 to start a conversation with your GPU avatar.
Integration Guide
The container exposes a simple HTTP API. Your LiveKit agent calls/launch to start an avatar session. There are two ways to integrate:
Option 1: LiveKit Python Plugin (Recommended)
Install the bitHuman LiveKit plugin:AvatarSession at your container’s /launch endpoint:
/launch automatically when a participant joins.
Option 2: Direct HTTP API
You can call/launch directly from any HTTP client. The container joins the LiveKit room as a video publisher.
HTTP API Reference
All endpoints are served on port8089 (default).
POST /launch
Start an avatar session for a LiveKit room. The container joins the room and begins streaming lip-synced video.
Content-Type: multipart/form-data
| Field | Type | Required | Description |
|---|---|---|---|
livekit_url | string | Yes | LiveKit server WebSocket URL (e.g. ws://livekit:7880) |
livekit_token | string | Yes | LiveKit room token with publish permissions |
room_name | string | Yes | LiveKit room name (must match token) |
avatar_image | file | No* | Avatar image file upload (JPEG/PNG) |
avatar_image_url | string | No* | Avatar image HTTPS URL (alternative to file upload) |
prompt | string | No | Motion prompt (default: "A person is talking naturally.") |
api_secret | string | No | Override billing secret (defaults to BITHUMAN_API_SECRET) |
async_mode | bool | No | Return immediately (true, default) or wait for session to end |
avatar_image or avatar_image_url. If neither is given, a default image is used.
Response (async_mode=true):
503 Service Unavailable— container still initializing, or at session capacity400 Bad Request— invalid image or download failed
GET /health
Lightweight health check. Always returns 200 once the container is running (even during model loading).
GET /ready
Readiness check. Returns 200 only when the model is loaded and a session slot is available. Use this to gate traffic in load balancers or health checks.
503 with "status": "not_ready" during model loading, or "status": "at_capacity" when all session slots are in use.
GET /tasks
List all sessions (active and completed).
GET /tasks/{task_id}
Check the status of a specific session.
pending → running → completed / failed / cancelled
POST /tasks/{task_id}/stop
Stop a running session and release the session slot.
POST /benchmark
Run an inference benchmark and return per-stage timing. Useful for verifying GPU performance.
GET /test-frame
Generate a few chunks and return the last frame as a JPEG. Useful for verifying the model is producing valid output.
Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
BITHUMAN_API_SECRET | Yes | — | API secret for billing and weight download |
MAX_SESSIONS | No | 8 | Max concurrent avatar sessions per container |
SESSION_IDLE_TIMEOUT | No | 600 | Seconds an idle session is held before auto-release. Lower (e.g. 120) if standby/widget sessions accumulate on shared GPUs. |
AUTO_PREWARM | No | true | Load model and compile at startup. Set false only for cold-start testing. |
TASK_RETENTION_SECONDS | No | 3600 | How long completed/failed task records are kept in /tasks history. |
MAX_TASK_HISTORY | No | 1000 | Hard cap on retained terminal tasks; oldest are dropped first. Active sessions are never evicted. |
PYTORCH_CUDA_ALLOC_CONF | No | expandable_segments:True | Set in image. Allows the CUDA allocator to grow/shrink dynamically — keeps VRAM steady across sessions. Don’t override unless you know why. |
CUDA_VISIBLE_DEVICES | No | all GPUs | Restrict to specific GPU (e.g. 0) |
Concurrency & Long-Running Containers
The container is designed to run continuously. The model is loaded once at startup (~5 GB GPU weights) and shared across all sessions; each session adds ~50 MB of per-session GPU state on top. This keeps VRAM steady — it does not grow with the number of sessions you’ve served, only with the number of sessions running concurrently.Tuning concurrency
SetMAX_SESSIONS to match your GPU. Each active session uses roughly 600 MB of additional VRAM beyond the shared 5 GB baseline:
POST /launch returns 503 at_capacity when all slots are used; clients should back off or scale horizontally (run a second container).
Reclaim idle GPU slots faster
Sessions sit at idle forSESSION_IDLE_TIMEOUT seconds (default 10 min) before being auto-released back to the pool. If you embed the avatar in widgets where users open and abandon tabs, idle slots can accumulate and exhaust capacity:
Memory monitoring
Container RSS should stabilize within the first hour of runtime (after PyTorch’s CUDA allocator settles). If RSS keeps climbing beyond that:Production resource limits (recommended for compose)
For long-running deployments, cap container resources so a runaway container can’t take down the host:bithuman-examples/expression-selfhosted ships with all of these defaults.
Performance Characteristics
| GPU Tier | VRAM Usage | Concurrent Sessions |
|---|---|---|
| High-end (data center) | ~6 GB | up to 8 concurrent |
| High-end (consumer) | ~6 GB | up to 4 concurrent |
| Mid-range | ~6 GB | up to 2 concurrent |
| Configuration | Time to First Frame | Description |
|---|---|---|
| Long-running container | ~4–6 seconds | Model loaded at startup; new sessions encode image (~2s) then stream |
| Cold start | ~48 seconds | Full GPU model compilation on first start (cached on subsequent starts) |
Long-Running Containers (Recommended)
Keep the container running between sessions. The model loads once at startup (~48s including GPU compilation), and subsequent sessions start in ~4–6 seconds.Troubleshooting
| Problem | Solution |
|---|---|
| Container won’t start | Check GPU: nvidia-smi; check logs: docker logs <id> |
| First start takes >5 minutes | Normal — weights are downloading (~4.7 GB). Check logs for download progress. |
| Download fails with 401 | Verify BITHUMAN_API_SECRET is set and valid |
| Download fails with connection error | Check outbound internet access from the container |
/health returns connection refused | Container still initializing — check docker logs <id> for startup progress |
/launch returns 503 not_ready | Model still loading — poll /ready until model_ready: true |
/launch returns 503 at_capacity | All session slots in use; increase MAX_SESSIONS or scale horizontally |
| Startup takes >2 minutes (after download) | GPU compilation runs once per container — subsequent starts reuse compiled cache |
| Out of memory (GPU) | Use a GPU with ≥8 GB VRAM; reduce MAX_SESSIONS if needed |
| Container RSS keeps climbing | Check /tasks count — if growing without bound, ensure you are on a recent image (docker pull sgubithuman/expression-avatar:latest) which prunes terminal task records every 5 minutes. Lower MAX_TASK_HISTORY if needed. |
| VRAM creeps up over time on idle container | Idle/standby sessions are not being reclaimed. Lower SESSION_IDLE_TIMEOUT (e.g. 120s for widget embeds). |
| Sessions stuck at “running” forever | Check the LiveKit room — the client may have disconnected without the worker noticing. POST /tasks/{id}/stop to force release. Idle timeout will also reclaim them after SESSION_IDLE_TIMEOUT. |
| Orphan GPU processes after container restart | nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv — kill leftover PIDs (kill -9 <pid>). Use init: true in compose to ensure PID 1 reaps subprocesses. |
503 at_capacity under load | Increase MAX_SESSIONS (if VRAM allows) or run a second container; round-robin via reverse proxy. |
| Container memory spikes during inference but doesn’t release | Normal for PyTorch’s caching allocator. With PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True (default), it will return to baseline within minutes of idle. |
| Billing not working | Verify BITHUMAN_API_SECRET is set and valid |
| Avatar image not showing | Check /test-frame — if it returns a valid JPEG, image encoding is working |
Next Steps
Docker Example
Full GPU setup with LiveKit server, AI agent, and web frontend
LiveKit Agents
LiveKit Agents Python SDK documentation
