Ahmed Hany / Work / whisper-arabic-dialects
whisper-arabic-dialects
Solo research project · A reproducible Arabic ASR benchmark and a fine-tuned production model
- Dialects
- 4 fine-tuned
- Backends
- 4 compared
- Benchmark cells
- 100+
- HF artifacts
- 3 public repos
[ Context ]
Arabic is the world's fifth most-spoken language but among the worst-served by open speech-to-text. OpenAI's Whisper, the de-facto open ASR baseline, looks fine on Modern Standard Arabic — about 10% WER zero-shot — but collapses on the dialects people actually speak: 65% WER on Egyptian, 60% on Gulf, and a brutal 85% WER on Maghrebi (Moroccan/Algerian). That gap means production Arabic ASR built on stock Whisper effectively serves only broadcasters and news readers; it doesn't serve a phone-call from Cairo or a voice message from Casablanca.
I started whisper-arabic-dialects as a research project to do two things: build a reproducible cross-backend Arabic ASR benchmark, and ship a fine-tuned production model that closes the dialect gap. The end-state is an arXiv submission plus three public HuggingFace artifacts.
[ The brief ]
Most public Arabic Whisper fine-tunes hit a known failure mode: train on one narrow source (usually Common Voice), report a low WER on that source's test split, and quietly destroy the model's multi-dialect zero-shot capability in the process. I checked the most popular such release on the same test sets I'd be publishing — it was 15-25 pp worse than the unmodified base model on every dialect, including the source dialect itself. The reported "31% WER" was an artifact of an English-style normalizer that over-collapses Arabic forms on both sides of the ratio.
So the brief became: don't let the project look good on a single number that doesn't survive scrutiny. That meant a deterministic Arabic-aware normalizer, dialect-balanced training data from four independent sources, and reporting WER on held-out test sets that the training mix never touched.
[ Architecture ]
- Training: QLoRA on a single L4 24GB GPU (rented on Google Cloud). NF4 4-bit base, double-quant, bf16 compute, LoRA r=32 / α=64 on q/k/v/out_proj + fc1/fc2. 17,246 train rows balanced across MSA / Egyptian / Levantine / Gulf, 3 epochs, ~7 hours wall time, ~$7 in GPU rental.
- Data sources: Common Voice 18 (MSA), MGB-3 (Egyptian broadcast TV), MASC (Levantine broadcast), and Casablanca splits. Maghrebi was deliberately scoped out — Whisper has insufficient Moroccan/Algerian Arabic in pretraining for QLoRA to recover within a reasonable budget — and the paper documents why.
- Cross-backend benchmark: a single evaluation harness drives four production-grade Whisper inference engines — faster-whisper / CTranslate2, HuggingFace transformers, openai-whisper, and whisper.cpp — across six Whisper model sizes (tiny → large-v3). Every cell logs WER, RTF, TTFT, peak RAM, and a 95% bootstrap confidence interval to a single JSONL file, with hardware fingerprint and git commit baked in.
- Multi-platform CPU: benchmarks run on GCP c3-standard-8 (Intel Sapphire Rapids) and Hetzner cx23 (AMD x86, $0.008/hr) and cax11 (Ampere ARM, $0.009/hr) — three distinct microarchitectures. The paper publishes cost-per-audio-minute tables so a reader can pick the right CPU for their accuracy budget.
- Three public artifacts: on every release we publish (a) the LoRA adapter (~111 MB) for downstream fine-tuning, (b) the merged HF model for direct inference, and (c) a CTranslate2 production model for sub-real-time CPU deployment.
[ What I owned ]
- The full research pipeline end-to-end — dataset preparation, training recipe, QLoRA config, generation patches for Whisper + PEFT compatibility, evaluation harness, statistical methodology (bootstrap CI + Wilcoxon).
- The deterministic Arabic normalizer (alef forms, ya/alef-maqsura, ta-marbuta, tashkeel, tatweel) baked into every WER number with a version field, so reviewers can reproduce.
- Cross-backend integration — running the same audio through four different inference engines and reconciling their outputs through one normalizer.
- Cloud orchestration: provisioning, sharding the benchmark matrix across three GCP and two Hetzner instances, syncing results, and tearing everything down when done.
- The arXiv-bound paper draft, including the negative findings — the single-source fine-tuning failure mode and the int8-quantization sensitivity of LoRA-merged Whisper weights.
[ Results ]
- A 4-dialect fine-tuned Whisper-large-v3-turbo, validated on held-out test sets the training mix never saw. Best checkpoint at validation WER 33% on the dialect-balanced mix, down from ~44% zero-shot.
- Three public HuggingFace repositories — LoRA adapter, merged HF model, and CTranslate2 production model — each with a real model card, training recipe, normalizer disclosure, and per-dialect WER table.
- 100+ benchmark cells across 4 inference backends × 6 model sizes × 5 dialects × multiple quantizations × multiple beams × 3 hardware platforms, all in a single version-controlled JSONL — anyone can rerun and verify.
- Two paper-worthy negative findings: (1) naive single-source fine-tuning is net-harmful — it regresses every dialect including the source by 15-25 pp — and (2) Whisper's default English-style normalizer systematically understates Arabic WER, which breaks cross-paper comparison unless normalizer version is reported explicitly.
- A cost-per-audio-minute recommendation table: on Hetzner's $0.008/hr cx23, the production fine-tuned turbo runs at roughly $0.0001 per minute of audio transcribed — about 17× cheaper than the equivalent GCP setup at comparable WER.
[ Stack ]
PyTorch, HuggingFace transformers, PEFT, bitsandbytes, faster-whisper, CTranslate2, whisper.cpp / pywhispercpp, openai-whisper, jiwer, librosa. Infrastructure on Google Cloud (Sapphire Rapids c3 + L4 GPU) and Hetzner Cloud (cx23 AMD x86 + cax11 ARM Ampere). Reproducibility tooling: a single deterministic JSONL log, version-pinned environments, and runbooks for every cloud command in deploy/.