Android SDK (Kotlin)
On-device avatar runtime for Android — a self-contained AAR via Maven Central. arm64-v8a, Android 10+. Beta.
Overview
ai.bithuman:sdk is a self-contained Android AAR. Audio in (16 kHz mono PCM),
25 FPS Bitmap / packed-BGR frames out. All inference runs on-device; a
once-per-minute billing heartbeat meters avatar mode. The AAR bundles every
native library — your app adds no ONNX Runtime, OpenSSL, or other system
dependency. This SDK is Beta.
Install
// app/build.gradle.kts
android {
defaultConfig {
ndk { abiFilters += setOf("arm64-v8a") }
minSdk = 29
}
}
dependencies { implementation("ai.bithuman:sdk:1.17.1") }
| Field | Value |
|---|---|
| Maven coordinate | ai.bithuman:sdk:1.17.1 |
| ABI | arm64-v8a only |
| Engine ABI | v6 (v7 refresh queued — track via release notes) |
| Min / Compile SDK | 29 (Android 10) / 35 |
| NDK | 28.0.13004108 |
| AAR size | ~40 MB |
mavenCentral() is in settings.gradle.kts by default in new projects.
armeabi-v7a / x86_64 are not supported — file an issue if you need them. The
Kotlin SDK has its own release cadence and is not locked to the Python
bithuman 2.x line.
Auth
Set your secret before the first Avatar.load / Fixture call:
class App : Application() {
override fun onCreate() {
super.onCreate()
System.setProperty("BITHUMAN_API_SECRET", BuildConfig.BITHUMAN_API_SECRET)
}
}
Get a secret at Developer → API Keys. The library exchanges it for a short-lived runtime token at startup and renews on the heartbeat (5-minute offline grace).
High-level: one avatar, one conversation
Push the model + a driving clip to the device, then:
import ai.bithuman.sdk.Avatar
Avatar.load("${filesDir}/avatar.imx").use { avatar ->
// From a file (WAV / MP3 / M4A, decoded via MediaExtractor):
avatar.composeFromFile("${filesDir}/speech.wav").forEach { frame ->
// frame.bgr is width*height*3 packed BGR uint8
}
// Or from raw 16 kHz mono FloatArray PCM (e.g. live mic):
avatar.composeAsBitmaps(pcm).forEach { bmp -> imageView.setImageBitmap(bmp) }
}
Each tick consumes 640 samples (40 ms @ 16 kHz). composeAsBitmaps needs
pcm.size >= avatar.samplesPerTick or it returns empty.
Streaming: long live conversations
The right shape when audio arrives incrementally (mic, WebRTC sink, TTS). Each push is constant cost; each pull returns one frame — session length never degrades per-frame performance. This is the Android expression of the audio-streaming push/drain loop.
import ai.bithuman.sdk.Fixture
import ai.bithuman.sdk.Runtime
Fixture(modelPath).use { fx ->
val rt = Runtime(fx)
val info = fx.info()
val frameOut = ByteArray(info.frameWidth * info.frameHeight * 3)
fun onAudio(pcm: ShortArray) { // from AudioRecord / WebRTC
rt.pushAudio(pcm)
while (rt.ticksAvailable > 0) {
rt.pullFrame(frameOut, frameIdxHint = -1)
renderer.present(frameOut, info.frameWidth, info.frameHeight)
}
}
fun onEndOfTurn() = rt.resetStream() // barge-in / agent switch
}
pushAudio and pullFrame are independent: push as audio arrives, pull on a
40 ms Choreographer tick. A single Runtime is not internally
synchronized — pin push/pull to one thread or wrap in your own mutex.
Multi-conversation hosts share one Fixture across many Runtimes to
amortize the model load:
Fixture(modelPath).use { fx ->
val rtA = Runtime(fx)
val rtB = Runtime(fx)
// ... interleave ticks per conversation
}
Default execution provider is CPU (predictable, identical across platforms).
NNAPI / QNN are accepted but currently no-op to CPU.
API surface
Public types in ai.bithuman.sdk:
| Tier | Types |
|---|---|
| High-level | Avatar, ComposedFrame |
| Low-level | Fixture, FixtureInfo, Runtime, ComposeResult |
| Config | ExecutionProvider |
| Errors | BithumanException |
| Auth | BithumanAuth, AuthState |
Fixture.libraryVersion() / Fixture.abiVersion() expose the linked
libessence version and ABI.
Hardware
arm64-v8a only. Runs on modern Android silicon (Snapdragon 8 Gen 1+, Tensor
G2+). Older arm64 chips are uncharacterized — treat as unsupported for production
until measured.
Measured on a Snapdragon 8 Gen 2 (Z Fold 5), Essence, CPU EP:
| Metric | Value |
|---|---|
| Tight-loop mean | 3.96 ms |
| Sustained FPS | 252 |
| RSS peak (PSS) | 139 MB |
Comfortable headroom over the 25 FPS / 40 ms tick budget (~10× faster than realtime in a tight loop).
Troubleshooting
UnsatisfiedLinkError on launch
Build variant didn’t include arm64-v8a — check abiFilters in defaultConfig.
BithumanException: AUTH_FAILED
Secret missing or invalid. Confirm BITHUMAN_API_SECRET is set before the first
Avatar.load.
First compose tick is slow (~400 ms)
First-run init. Pre-warm with one silent tick at startup for consistent latency from frame one.
composeAsBitmaps returns empty
pcm.size must be ≥ avatar.samplesPerTick (640 samples / tick).
See also
- SDK overview — which SDK to pick
- Audio streaming — the canonical push/drain loop
- Models — Essence vs Expression
- Swift SDK — the Apple counterpart