Ahmed Hany / Work / whisper-arabic-dialects

whisper-arabic-dialects

Solo research project · A reproducible Arabic ASR benchmark and a fine-tuned production model

Dialects
4 fine-tuned
Backends
4 compared
Benchmark cells
100+
HF artifacts
3 public repos

[ Context ]

Arabic is the world's fifth most-spoken language but among the worst-served by open speech-to-text. OpenAI's Whisper, the de-facto open ASR baseline, looks fine on Modern Standard Arabic — about 10% WER zero-shot — but collapses on the dialects people actually speak: 65% WER on Egyptian, 60% on Gulf, and a brutal 85% WER on Maghrebi (Moroccan/Algerian). That gap means production Arabic ASR built on stock Whisper effectively serves only broadcasters and news readers; it doesn't serve a phone-call from Cairo or a voice message from Casablanca.

I started whisper-arabic-dialects as a research project to do two things: build a reproducible cross-backend Arabic ASR benchmark, and ship a fine-tuned production model that closes the dialect gap. The end-state is an arXiv submission plus three public HuggingFace artifacts.

[ The brief ]

Most public Arabic Whisper fine-tunes hit a known failure mode: train on one narrow source (usually Common Voice), report a low WER on that source's test split, and quietly destroy the model's multi-dialect zero-shot capability in the process. I checked the most popular such release on the same test sets I'd be publishing — it was 15-25 pp worse than the unmodified base model on every dialect, including the source dialect itself. The reported "31% WER" was an artifact of an English-style normalizer that over-collapses Arabic forms on both sides of the ratio.

So the brief became: don't let the project look good on a single number that doesn't survive scrutiny. That meant a deterministic Arabic-aware normalizer, dialect-balanced training data from four independent sources, and reporting WER on held-out test sets that the training mix never touched.

[ Architecture ]

[ What I owned ]

[ Results ]

[ Stack ]

PyTorch, HuggingFace transformers, PEFT, bitsandbytes, faster-whisper, CTranslate2, whisper.cpp / pywhispercpp, openai-whisper, jiwer, librosa. Infrastructure on Google Cloud (Sapphire Rapids c3 + L4 GPU) and Hetzner Cloud (cx23 AMD x86 + cax11 ARM Ampere). Reproducibility tooling: a single deterministic JSONL log, version-pinned environments, and runbooks for every cloud command in deploy/.


← Back to portfolio