Pan-African initiatives
Hubs, campaigns, and vendors that span languages and, in most cases, countries—not single-language corpora. These cards are a short overview; confirm licence terms, coverage, and delivery timelines with each organisation before you rely on them in production.
-
Swivuriso (ANV · DSFSI)
ASR
- ZA-African Next Voices track with isiZulu, isiXhosa, Sesotho, Setswana, Xitsonga, Tshivenda, and isiNdebele
- Scripted and unscripted first-language speech with transcriptions; CC BY 4.0 on Hugging Face
- Availability
- Free
- Hours (indicative)
- ~3k+ combined (7 South African languages)
- Status
- Ongoing
-
African Voices (ANV DSN)
ASR / TTS
- Data Science Nigeria ANV track focused on Nigerian speech resources with transcriptions
- African Voices platform currently publishes Hausa, Igbo, Nigerian Pidgin, and Yoruba datasets
- Public portal headline reports 1.9k+ audio hours, 1.9m+ sentences, and 500+ unique voices
- Availability
- Free
- Hours (indicative)
- ~1.9k+ audio hours (platform headline)
- Status
- Ongoing
-
African Next Voices (ANV KenCorpus)
ASR
- KenCorpus Consortium / Gates-backed Kenya collection; CC BY 4.0
- Per-language Hugging Face repos: Dholuo, Kikuyu, Somali, Kalenjin, Maasai
- Scripted and unscripted speech across agriculture, healthcare, finance, media, and other domains
- Availability
- Free
- Hours (indicative)
- ~2.5k+ combined (indicative; see HF cards)
- Status
- ongoing (WIP per org README)
-
Masakhane
MT / Text / Some speech (via collaborations)
- 40+ African languages, pan-African
- JW300, FLORES-200, Bible corpora, custom translations
- Strong for low-resource MT/NLP (e.g. Fulfulde, Lingala, Wolof)—not primarily speech
- Availability
- Free
- Status
- Ongoing
-
Mozilla Common Voice (African campaigns)
ASR
- Largest open speech source for many African languages
- Quality and balance vary by locale
- Availability
- Free
- Hours (indicative)
- Varies (10–1000+ per language)
- Status
- Ongoing
-
Meta MMS
ASR / TTS
- Massively Multilingual Speech covers many African languages for ASR and speech synthesis research
- Strong bootstrap option for under-resourced languages; verify language-tag alignment before production use
- Availability
- Free
- Hours (indicative)
- Large multilingual corpus (per-language hours vary)
- Status
- Ongoing
-
Google FLEURS
ASR benchmark / speech evaluation
- Consistent multilingual benchmark useful for cross-language ASR comparison
- Better suited to evaluation and baselines than large-scale production training
- Availability
- Free
- Hours (indicative)
- ~10 per language (~1000+ total train across 102 languages)
- Status
- Ongoing
-
ALFFA
ASR
- Early African ASR benchmarks (e.g. Amharic, Hausa, Swahili, Wolof)
- Still used as research baselines
- Availability
- Free
- Hours (indicative)
- ~20 per language (where released)
- Status
- Inactive (archive)
-
SADiLaR / NCHLT
ASR / Text
- Government-backed South African corpora
- Speech, text, and lexicons
- Availability
- Free
- Hours (indicative)
- ~50 per language (official SA languages)
- Status
- Ongoing