Arabic ASR Fine-Tune
Fine-tuned Whisper Large-v3 with LoRA across 14 Arabic dialects, then optimized for sub-400 ms inference.
- Period
- 2025 · Arcain
- Role
- AI / Voice Engineer
- Stack
- PyTorch · Hugging Face Transformers · PEFT / LoRA · CTranslate2 · Faster-Whisper · AWS EC2 GPU
Outcome
Context
Arabic is one of the most underserved languages in production ASR. Off-the-shelf Whisper Large-v3 ran at 30.6% WER on dialectal Arabic — usable for transcription, unusable for a real-time voice agent that needs to act on what it hears. The goal was to bring quality to production grade across 14 dialects without giving up inference latency.
Approach
Fine-tuned Whisper Large-v3 with LoRA adapters (rank 32, alpha 64) on a curated corpus of 53,267 training samples and 5,913 evaluation samples spanning Gulf, Levantine, Egyptian, and Maghrebi dialects.
After training, converted the merged model to CTranslate2 and served it through Faster-Whisper. That conversion was the difference between a research artifact and a production speech engine — it cut p50 ASR latency to 345 ms on AWS GPU and kept memory headroom for concurrent sessions.
The bug that almost killed it
During inference the model would loop — repeating a token forever until the decoder hit its max length. After tracing it back through the data collator I found a BOS-token double-prepend bug: the collator was inserting the beginning-of-sequence token a second time on top of what the tokenizer had already added. The model learned to output it during training, then at inference time the decoder saw it and never stopped.
One-line fix in the collator. The lesson — that bug only surfaces when you actually run end-to-end inference under load — went into the team's internal postmortem.
What I'd do differently
Add inference-time generation tests to the training loop, not just loss + WER on held-out audio. Loss can look healthy while the decoder is silently broken.