Arabic ASR Fine-Tune

Fine-tuned Whisper Large-v3 with LoRA across 14 Arabic dialects, then optimized for sub-400 ms inference.

Period: 2025 · Arcain
Role: AI / Voice Engineer
Stack: PyTorch · Hugging Face Transformers · PEFT / LoRA · CTranslate2 · Faster-Whisper · AWS EC2 GPU

Outcome

17.8%

WER (from 30.6%)

41.8%

Relative reduction

345 ms

p50 ASR latency

Context

Arabic is one of the most underserved languages in production ASR. Off-the-shelf Whisper Large-v3 ran at 30.6% WER on dialectal Arabic — usable for transcription, unusable for a real-time voice agent that needs to act on what it hears. The goal was to bring quality to production grade across 14 dialects without giving up inference latency.

Approach

Fine-tuned Whisper Large-v3 with LoRA adapters (rank 32, alpha 64) on a curated corpus of 53,267 training samples and 5,913 evaluation samples spanning Gulf, Levantine, Egyptian, and Maghrebi dialects.

After training, converted the merged model to CTranslate2 and served it through Faster-Whisper. That conversion was the difference between a research artifact and a production speech engine — it cut p50 ASR latency to 345 ms on AWS GPU and kept memory headroom for concurrent sessions.

The bug that almost killed it

During inference the model would loop — repeating a token forever until the decoder hit its max length. After tracing it back through the data collator I found a BOS-token double-prepend bug: the collator was inserting the beginning-of-sequence token a second time on top of what the tokenizer had already added. The model learned to output it during training, then at inference time the decoder saw it and never stopped.

One-line fix in the collator. The lesson — that bug only surfaces when you actually run end-to-end inference under load — went into the team's internal postmortem.

What I'd do differently

Add inference-time generation tests to the training loop, not just loss + WER on held-out audio. Loss can look healthy while the decoder is silently broken.