Preview Feature — 2 credits per minute while using the GPU container.
Overview
The self-hosted GPU avatar container (docker.io/sgubithuman/expression-avatar:latest) enables production-grade avatar generation on your own GPU infrastructure.
- Full Control — Complete control over deployment, scaling, and configuration
- Cost Optimization — Pay only for the GPU resources you use
- Data Privacy — Avatar images and audio never leave your infrastructure
- Customization — Extend the worker with custom logic and integrations
How It Works
The container is a GPU worker that joins a LiveKit room and streams avatar video frames in real time. Your application calls the /launch endpoint with LiveKit room credentials and an avatar image; the container connects to the room, listens for audio, and generates lip-synced video at 25 FPS — entirely on your GPU.
Your Agent (LiveKit)
│
│ POST /launch
│ { livekit_url, livekit_token, room_name, avatar_image }
▼
expression-avatar container
│
├─ Joins LiveKit room as video publisher
├─ Receives audio from agent via data stream
└─ Generates 25 FPS lip-synced video → streams to room
↑
100% local GPU — no cloud calls during inference
Prerequisites
Quick Start
Model weights download automatically on first run — just provide your API secret:
# 1. Pull the image (includes wav2vec2 audio encoder, ~360 MB)
docker pull docker.io/sgubithuman/expression-avatar:latest
# 2. Run — proprietary weights (~4.7 GB) download automatically on first start
docker run --gpus all -p 8089:8089 \
-v bithuman-models:/data/models \
-e BITHUMAN_API_SECRET=your_api_secret \
docker.io/sgubithuman/expression-avatar:latest
# 3. Wait for startup (first run: ~3 min download + ~48s GPU compilation)
# Subsequent starts: ~48s (weights already cached in the named volume)
curl http://localhost:8089/health
# {"status": "healthy", "service": "expression-avatar", "active_sessions": 0, "max_sessions": 8}
The -v bithuman-models:/data/models named volume caches the downloaded weights so you only pay the download cost once.
Once healthy, the container is ready to accept avatar sessions via /launch.
Docker Compose Setup
Use the full example for a complete setup with LiveKit, an AI agent, and a web frontend:
git clone https://github.com/bithuman-product/examples.git
cd examples/expression-selfhosted
# Configure environment
cp .env.example .env
# Edit .env with your API secret, OpenAI key, and avatar image
# Copy your avatar image into ./avatars/
mkdir -p avatars
cp /path/to/your/avatar.jpg avatars/
# Model weights download automatically on first run — nothing to pre-download!
docker compose up
Open http://localhost:4202 to start a conversation with your GPU avatar.
Integration Guide
The container exposes a simple HTTP API. Your LiveKit agent calls /launch to start an avatar session. There are two ways to integrate:
Option 1: LiveKit Python Plugin (Recommended)
Install the bitHuman LiveKit plugin:
pip install livekit-plugins-bithuman
In your LiveKit agent, point AvatarSession at your container’s /launch endpoint:
from livekit.agents import Agent, AgentSession, JobContext, WorkerOptions, WorkerType, cli
from livekit.plugins import bithuman, openai, silero
async def entrypoint(ctx: JobContext):
await ctx.connect()
await ctx.wait_for_participant()
avatar = bithuman.AvatarSession(
api_url="http://localhost:8089/launch", # your container
api_secret="your_api_secret", # for billing
avatar_image="/path/to/avatar.jpg", # local file or HTTPS URL
)
session = AgentSession(
llm=openai.realtime.RealtimeModel(voice="coral"),
vad=silero.VAD.load(),
)
await avatar.start(session, room=ctx.room)
await session.start(
agent=Agent(instructions="You are a helpful assistant."),
room=ctx.room,
)
if __name__ == "__main__":
cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint, worker_type=WorkerType.ROOM))
The plugin handles room token generation and calls /launch automatically when a participant joins.
Option 2: Direct HTTP API
You can call /launch directly from any HTTP client. The container joins the LiveKit room as a video publisher.
# Generate a LiveKit room token first (using livekit-server-sdk or CLI)
TOKEN=$(livekit-token create --room my-room --identity avatar-worker \
--api-key devkey --api-secret your-livekit-secret)
# Launch with an image URL
curl -X POST http://localhost:8089/launch \
-F "livekit_url=ws://your-livekit-server:7880" \
-F "livekit_token=$TOKEN" \
-F "room_name=my-room" \
-F "avatar_image_url=https://example.com/avatar.jpg"
# Or upload an image file directly
curl -X POST http://localhost:8089/launch \
-F "livekit_url=ws://your-livekit-server:7880" \
-F "livekit_token=$TOKEN" \
-F "room_name=my-room" \
-F "avatar_image=@./avatar.jpg"
Response (async by default):
{
"status": "pending",
"task_id": "a1b2c3d4",
"room_name": "my-room"
}
The avatar is live in the room within ~4–6 seconds.
HTTP API Reference
All endpoints are served on port 8089 (default).
POST /launch
Start an avatar session for a LiveKit room. The container joins the room and begins streaming lip-synced video.
Content-Type: multipart/form-data
| Field | Type | Required | Description |
|---|
livekit_url | string | Yes | LiveKit server WebSocket URL (e.g. ws://livekit:7880) |
livekit_token | string | Yes | LiveKit room token with publish permissions |
room_name | string | Yes | LiveKit room name (must match token) |
avatar_image | file | No* | Avatar image file upload (JPEG/PNG) |
avatar_image_url | string | No* | Avatar image HTTPS URL (alternative to file upload) |
prompt | string | No | Motion prompt (default: "A person is talking naturally.") |
api_secret | string | No | Override billing secret (defaults to BITHUMAN_API_SECRET) |
async_mode | bool | No | Return immediately (true, default) or wait for session to end |
*Provide either avatar_image or avatar_image_url. If neither is given, a default image is used.
Response (async_mode=true):
{ "status": "pending", "task_id": "a1b2c3d4", "room_name": "my-room" }
Error responses:
503 Service Unavailable — container still initializing, or at session capacity
400 Bad Request — invalid image or download failed
GET /health
Lightweight health check. Always returns 200 once the container is running (even during model loading).
{
"status": "healthy",
"service": "expression-avatar",
"active_sessions": 2,
"max_sessions": 8
}
GET /ready
Readiness check. Returns 200 only when the model is loaded and a session slot is available. Use this to gate traffic in load balancers or health checks.
{
"status": "ready",
"model_ready": true,
"active_sessions": 2,
"available_sessions": 6,
"max_sessions": 8
}
Returns 503 with "status": "not_ready" during model loading, or "status": "at_capacity" when all session slots are in use.
GET /tasks
List all sessions (active and completed).
curl http://localhost:8089/tasks
{
"tasks": [
{
"task_id": "a1b2c3d4",
"room_name": "my-room",
"status": "running",
"created_at": "2024-01-01T12:00:00",
"completed_at": null,
"error": null
}
]
}
GET /tasks/{task_id}
Check the status of a specific session.
{
"task_id": "a1b2c3d4",
"room_name": "my-room",
"status": "running",
"created_at": "2024-01-01T12:00:00",
"completed_at": null,
"error": null
}
Status values: pending → running → completed / failed / cancelled
POST /tasks/{task_id}/stop
Stop a running session and release the session slot.
curl -X POST http://localhost:8089/tasks/a1b2c3d4/stop
POST /benchmark
Run an inference benchmark and return per-stage timing. Useful for verifying GPU performance.
curl -X POST "http://localhost:8089/benchmark?iterations=10"
{
"iterations": 10,
"frames_per_generate": 24,
"avg_ms": 79.3,
"fps": 302.6,
"stages": {
"dit_ms": 41.2,
"vae_decode_ms": 13.1,
"vae_encode_ms": 8.5,
"color_correct_ms": 6.1,
"postprocess_ms": 2.8,
"audio_ms": 7.1
},
"vram_gb": 6.2,
"gpu": "NVIDIA GPU"
}
GET /test-frame
Generate a few chunks and return the last frame as a JPEG. Useful for verifying the model is producing valid output.
curl http://localhost:8089/test-frame --output frame.jpg
open frame.jpg
Environment Variables
| Variable | Required | Default | Description |
|---|
BITHUMAN_API_SECRET | Yes | — | API secret for billing and weight download |
MAX_SESSIONS | No | 8 | Max concurrent avatar sessions |
CUDA_VISIBLE_DEVICES | No | all GPUs | Restrict to specific GPU (e.g. 0) |
BITHUMAN_API_URL | No | https://api.bithuman.ai | Override API endpoint (for testing) |
FAST_DECODER_CONFIG | No | — | Path to fast decoder config JSON (optional speedup) |
FAST_DECODER_CHECKPOINT | No | — | Path to fast decoder weights (optional speedup) |
Without BITHUMAN_API_SECRET, avatar sessions will run but usage will not be tracked or billed. This is not permitted for production use.
| GPU Tier | VRAM Usage | Concurrent Sessions |
|---|
| High-end (data center) | ~6 GB | up to 8 concurrent |
| High-end (consumer) | ~6 GB | up to 4 concurrent |
| Mid-range | ~6 GB | up to 2 concurrent |
| Configuration | Time to First Frame | Description |
|---|
| Long-running container | ~4–6 seconds | Model loaded at startup; new sessions encode image (~2s) then stream |
| Cold start | ~48 seconds | Full GPU model compilation on first start (cached on subsequent starts) |
Long-Running Containers (Recommended)
Keep the container running between sessions. The model loads once at startup (~48s including GPU compilation), and subsequent sessions start in ~4–6 seconds.
docker run --gpus all -p 8089:8089 --restart always \
-v bithuman-models:/data/models \
-e BITHUMAN_API_SECRET=your_api_secret \
docker.io/sgubithuman/expression-avatar:latest
Troubleshooting
| Problem | Solution |
|---|
| Container won’t start | Check GPU: nvidia-smi; check logs: docker logs <id> |
| First start takes >5 minutes | Normal — weights are downloading (~4.7 GB). Check logs for download progress. |
| Download fails with 401 | Verify BITHUMAN_API_SECRET is set and valid |
| Download fails with connection error | Check outbound internet access from the container |
/health returns connection refused | Container still initializing — wait for PREWARM: Pipeline loaded in logs |
/launch returns 503 not_ready | Model still loading — poll /ready until model_ready: true |
/launch returns 503 at_capacity | All session slots in use; increase MAX_SESSIONS or scale horizontally |
| Startup takes >2 minutes (after download) | GPU compilation runs once per container — subsequent starts reuse compiled cache |
| Out of memory | Use a GPU with ≥8 GB VRAM; reduce MAX_SESSIONS if needed |
| Billing not working | Verify BITHUMAN_API_SECRET is set; check logs for [HEARTBEAT] messages |
| Avatar image not showing | Check /test-frame — if it returns a valid JPEG, image encoding is working |
Next Steps