Arcain Voice Engine

A dual-mode FastAPI service exposing streaming ASR / TTS WebSocket endpoints, with barge-in detection and end-to-end telemetry, deployed on AWS GPU.

Period: Dec 2025 — Present
Role: AI / Voice Engineer
Stack: FastAPI · WebSockets · Silero VAD v5 · F5-TTS · OpenTelemetry · AWS EC2 GPU · Docker

Outcome

345 ms

p50 ASR latency

32 ms

VAD chunk size

60 ms

Barge-in trigger

Context

Arcain builds Arabic conversational AI. A real-time voice agent has a brutal latency budget: every link in the chain — VAD, ASR, LLM, TTS, audio framing — adds tens to hundreds of milliseconds, and the user feels every one of them. The engine had to support two deployment modes from day one: a full pipeline for phone-call traffic, and a platform-only mode that exposes streaming ASR and TTS for external integrations.

Architecture

Dual-mode FastAPI service — one binary, one config flag switches between full-pipeline (telephony → inference → audio) and platform-only (WebSocket ASR / TTS endpoints).
Silero VAD v5 with 32 ms chunks and per-session state for accurate speech onset / offset detection.
Energy-threshold barge-in on top of VAD: RMS gating with a 60 ms trigger eliminates false interrupts while still cutting the agent off fast enough to feel natural.
OpenTelemetry OTLP pipeline from edge to GPU — every audio frame and inference span is observable in production.
Environment-variable model overrides — zero-downtime rollback between checkpoints by flipping a single env var.

What broke (and what I learned)

Latency under sustained load was the hardest problem. Three root causes, none of them obvious:

WebSocket session lifecycle bugs leaking sockets on abnormal disconnect.
Audio framing mismatches between the client's 16 kHz PCM and the decoder's expected window — silently degraded WER and added a tail-latency spike.
GPU memory leaks under sustained streaming. Fixed by tightening tensor lifetimes around the inference span and explicit cache eviction.

Separately, tuning VAD thresholds eliminated barge-in false triggers that were cutting off the agent mid-syllable on noisy phone audio.

Shipping discipline

Every release ran a smoke + integration test suite before AWS staging. The first F5-TTS dialect fine-tune passed on first deploy — because the rollback pattern meant we were never one bad checkpoint away from an outage.