Skip to main content
Preview Feature — 2 credits per minute while using the GPU container.

Overview

The self-hosted GPU avatar container (docker.io/sgubithuman/expression-avatar:latest) enables production-grade avatar generation on your own GPU infrastructure.
  • Full Control — Complete control over deployment, scaling, and configuration
  • Cost Optimization — Pay only for the GPU resources you use
  • Data Privacy — Avatar images and audio never leave your infrastructure
  • Customization — Extend the worker with custom logic and integrations

How It Works

The container is a GPU worker that joins a LiveKit room and streams avatar video frames in real time. Your application calls the /launch endpoint with LiveKit room credentials and an avatar image; the container connects to the room, listens for audio, and generates lip-synced video at 25 FPS — entirely on your GPU.
Your Agent (LiveKit)

      │  POST /launch
      │  { livekit_url, livekit_token, room_name, avatar_image }

expression-avatar container

      ├─ Joins LiveKit room as video publisher
      ├─ Receives audio from agent via data stream
      └─ Generates 25 FPS lip-synced video → streams to room

         100% local GPU — no cloud calls during inference

Prerequisites


Quick Start

Model weights download automatically on first run — just provide your API secret:
# 1. Pull the image (includes wav2vec2 audio encoder, ~360 MB)
docker pull docker.io/sgubithuman/expression-avatar:latest

# 2. Run — proprietary weights (~4.7 GB) download automatically on first start
docker run --gpus all -p 8089:8089 \
    -v bithuman-models:/data/models \
    -e BITHUMAN_API_SECRET=your_api_secret \
    docker.io/sgubithuman/expression-avatar:latest
# 3. Wait for startup (first run: ~3 min download + ~48s GPU compilation)
#    Subsequent starts: ~48s (weights already cached in the named volume)
curl http://localhost:8089/health
# {"status": "healthy", "service": "expression-avatar", "active_sessions": 0, "max_sessions": 8}
The -v bithuman-models:/data/models named volume caches the downloaded weights so you only pay the download cost once. Once healthy, the container is ready to accept avatar sessions via /launch.

Docker Compose Setup

Use the full example for a complete setup with LiveKit, an AI agent, and a web frontend:
git clone https://github.com/bithuman-product/examples.git
cd examples/expression-selfhosted

# Configure environment
cp .env.example .env
# Edit .env with your API secret, OpenAI key, and avatar image

# Copy your avatar image into ./avatars/
mkdir -p avatars
cp /path/to/your/avatar.jpg avatars/

# Model weights download automatically on first run — nothing to pre-download!
docker compose up
Open http://localhost:4202 to start a conversation with your GPU avatar.

Integration Guide

The container exposes a simple HTTP API. Your LiveKit agent calls /launch to start an avatar session. There are two ways to integrate: Install the bitHuman LiveKit plugin:
pip install livekit-plugins-bithuman
In your LiveKit agent, point AvatarSession at your container’s /launch endpoint:
from livekit.agents import Agent, AgentSession, JobContext, WorkerOptions, WorkerType, cli
from livekit.plugins import bithuman, openai, silero

async def entrypoint(ctx: JobContext):
    await ctx.connect()
    await ctx.wait_for_participant()

    avatar = bithuman.AvatarSession(
        api_url="http://localhost:8089/launch",   # your container
        api_secret="your_api_secret",              # for billing
        avatar_image="/path/to/avatar.jpg",        # local file or HTTPS URL
    )

    session = AgentSession(
        llm=openai.realtime.RealtimeModel(voice="coral"),
        vad=silero.VAD.load(),
    )

    await avatar.start(session, room=ctx.room)
    await session.start(
        agent=Agent(instructions="You are a helpful assistant."),
        room=ctx.room,
    )

if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint, worker_type=WorkerType.ROOM))
The plugin handles room token generation and calls /launch automatically when a participant joins.

Option 2: Direct HTTP API

You can call /launch directly from any HTTP client. The container joins the LiveKit room as a video publisher.
# Generate a LiveKit room token first (using livekit-server-sdk or CLI)
TOKEN=$(livekit-token create --room my-room --identity avatar-worker \
    --api-key devkey --api-secret your-livekit-secret)

# Launch with an image URL
curl -X POST http://localhost:8089/launch \
  -F "livekit_url=ws://your-livekit-server:7880" \
  -F "livekit_token=$TOKEN" \
  -F "room_name=my-room" \
  -F "avatar_image_url=https://example.com/avatar.jpg"

# Or upload an image file directly
curl -X POST http://localhost:8089/launch \
  -F "livekit_url=ws://your-livekit-server:7880" \
  -F "livekit_token=$TOKEN" \
  -F "room_name=my-room" \
  -F "avatar_image=@./avatar.jpg"
Response (async by default):
{
  "status": "pending",
  "task_id": "a1b2c3d4",
  "room_name": "my-room"
}
The avatar is live in the room within ~4–6 seconds.

HTTP API Reference

All endpoints are served on port 8089 (default).

POST /launch

Start an avatar session for a LiveKit room. The container joins the room and begins streaming lip-synced video. Content-Type: multipart/form-data
FieldTypeRequiredDescription
livekit_urlstringYesLiveKit server WebSocket URL (e.g. ws://livekit:7880)
livekit_tokenstringYesLiveKit room token with publish permissions
room_namestringYesLiveKit room name (must match token)
avatar_imagefileNo*Avatar image file upload (JPEG/PNG)
avatar_image_urlstringNo*Avatar image HTTPS URL (alternative to file upload)
promptstringNoMotion prompt (default: "A person is talking naturally.")
api_secretstringNoOverride billing secret (defaults to BITHUMAN_API_SECRET)
async_modeboolNoReturn immediately (true, default) or wait for session to end
*Provide either avatar_image or avatar_image_url. If neither is given, a default image is used. Response (async_mode=true):
{ "status": "pending", "task_id": "a1b2c3d4", "room_name": "my-room" }
Error responses:
  • 503 Service Unavailable — container still initializing, or at session capacity
  • 400 Bad Request — invalid image or download failed

GET /health

Lightweight health check. Always returns 200 once the container is running (even during model loading).
{
  "status": "healthy",
  "service": "expression-avatar",
  "active_sessions": 2,
  "max_sessions": 8
}

GET /ready

Readiness check. Returns 200 only when the model is loaded and a session slot is available. Use this to gate traffic in load balancers or health checks.
{
  "status": "ready",
  "model_ready": true,
  "active_sessions": 2,
  "available_sessions": 6,
  "max_sessions": 8
}
Returns 503 with "status": "not_ready" during model loading, or "status": "at_capacity" when all session slots are in use.

GET /tasks

List all sessions (active and completed).
curl http://localhost:8089/tasks
{
  "tasks": [
    {
      "task_id": "a1b2c3d4",
      "room_name": "my-room",
      "status": "running",
      "created_at": "2024-01-01T12:00:00",
      "completed_at": null,
      "error": null
    }
  ]
}

GET /tasks/{task_id}

Check the status of a specific session.
{
  "task_id": "a1b2c3d4",
  "room_name": "my-room",
  "status": "running",
  "created_at": "2024-01-01T12:00:00",
  "completed_at": null,
  "error": null
}
Status values: pendingrunningcompleted / failed / cancelled

POST /tasks/{task_id}/stop

Stop a running session and release the session slot.
curl -X POST http://localhost:8089/tasks/a1b2c3d4/stop

POST /benchmark

Run an inference benchmark and return per-stage timing. Useful for verifying GPU performance.
curl -X POST "http://localhost:8089/benchmark?iterations=10"
{
  "iterations": 10,
  "frames_per_generate": 24,
  "avg_ms": 79.3,
  "fps": 302.6,
  "stages": {
    "dit_ms": 41.2,
    "vae_decode_ms": 13.1,
    "vae_encode_ms": 8.5,
    "color_correct_ms": 6.1,
    "postprocess_ms": 2.8,
    "audio_ms": 7.1
  },
  "vram_gb": 6.2,
  "gpu": "NVIDIA GPU"
}

GET /test-frame

Generate a few chunks and return the last frame as a JPEG. Useful for verifying the model is producing valid output.
curl http://localhost:8089/test-frame --output frame.jpg
open frame.jpg

Environment Variables

VariableRequiredDefaultDescription
BITHUMAN_API_SECRETYesAPI secret for billing and weight download
MAX_SESSIONSNo8Max concurrent avatar sessions
CUDA_VISIBLE_DEVICESNoall GPUsRestrict to specific GPU (e.g. 0)
BITHUMAN_API_URLNohttps://api.bithuman.aiOverride API endpoint (for testing)
FAST_DECODER_CONFIGNoPath to fast decoder config JSON (optional speedup)
FAST_DECODER_CHECKPOINTNoPath to fast decoder weights (optional speedup)
Without BITHUMAN_API_SECRET, avatar sessions will run but usage will not be tracked or billed. This is not permitted for production use.

Performance Characteristics

GPU TierVRAM UsageConcurrent Sessions
High-end (data center)~6 GBup to 8 concurrent
High-end (consumer)~6 GBup to 4 concurrent
Mid-range~6 GBup to 2 concurrent
ConfigurationTime to First FrameDescription
Long-running container~4–6 secondsModel loaded at startup; new sessions encode image (~2s) then stream
Cold start~48 secondsFull GPU model compilation on first start (cached on subsequent starts)
Keep the container running between sessions. The model loads once at startup (~48s including GPU compilation), and subsequent sessions start in ~4–6 seconds.
docker run --gpus all -p 8089:8089 --restart always \
    -v bithuman-models:/data/models \
    -e BITHUMAN_API_SECRET=your_api_secret \
    docker.io/sgubithuman/expression-avatar:latest

Troubleshooting

ProblemSolution
Container won’t startCheck GPU: nvidia-smi; check logs: docker logs <id>
First start takes >5 minutesNormal — weights are downloading (~4.7 GB). Check logs for download progress.
Download fails with 401Verify BITHUMAN_API_SECRET is set and valid
Download fails with connection errorCheck outbound internet access from the container
/health returns connection refusedContainer still initializing — wait for PREWARM: Pipeline loaded in logs
/launch returns 503 not_readyModel still loading — poll /ready until model_ready: true
/launch returns 503 at_capacityAll session slots in use; increase MAX_SESSIONS or scale horizontally
Startup takes >2 minutes (after download)GPU compilation runs once per container — subsequent starts reuse compiled cache
Out of memoryUse a GPU with ≥8 GB VRAM; reduce MAX_SESSIONS if needed
Billing not workingVerify BITHUMAN_API_SECRET is set; check logs for [HEARTBEAT] messages
Avatar image not showingCheck /test-frame — if it returns a valid JPEG, image encoding is working

Next Steps