Arcain Voice Engine
A dual-mode FastAPI service exposing streaming ASR / TTS WebSocket endpoints, with barge-in detection and end-to-end telemetry, deployed on AWS GPU.
- Period
- Dec 2025 — Present
- Role
- AI / Voice Engineer
- Stack
- FastAPI · WebSockets · Silero VAD v5 · F5-TTS · OpenTelemetry · AWS EC2 GPU · Docker
Outcome
Context
Arcain builds Arabic conversational AI. A real-time voice agent has a brutal latency budget: every link in the chain — VAD, ASR, LLM, TTS, audio framing — adds tens to hundreds of milliseconds, and the user feels every one of them. The engine had to support two deployment modes from day one: a full pipeline for phone-call traffic, and a platform-only mode that exposes streaming ASR and TTS for external integrations.
Architecture
- Dual-mode FastAPI service — one binary, one config flag switches between full-pipeline (telephony → inference → audio) and platform-only (WebSocket ASR / TTS endpoints).
- Silero VAD v5 with 32 ms chunks and per-session state for accurate speech onset / offset detection.
- Energy-threshold barge-in on top of VAD: RMS gating with a 60 ms trigger eliminates false interrupts while still cutting the agent off fast enough to feel natural.
- OpenTelemetry OTLP pipeline from edge to GPU — every audio frame and inference span is observable in production.
- Environment-variable model overrides — zero-downtime rollback between checkpoints by flipping a single env var.
What broke (and what I learned)
Latency under sustained load was the hardest problem. Three root causes, none of them obvious:
- WebSocket session lifecycle bugs leaking sockets on abnormal disconnect.
- Audio framing mismatches between the client's 16 kHz PCM and the decoder's expected window — silently degraded WER and added a tail-latency spike.
- GPU memory leaks under sustained streaming. Fixed by tightening tensor lifetimes around the inference span and explicit cache eviction.
Separately, tuning VAD thresholds eliminated barge-in false triggers that were cutting off the agent mid-syllable on noisy phone audio.
Shipping discipline
Every release ran a smoke + integration test suite before AWS staging. The first F5-TTS dialect fine-tune passed on first deploy — because the rollback pattern meant we were never one bad checkpoint away from an outage.