Comparing data availability by region

Languages across Africa

Explore our catalogue languages by broad region—an orientation map for planning and discovery, not a political boundary.

Legend — catalogue languages

North Africa

Arabic (North Africa) · Berber (Tamazight)
Horn of Africa

Amharic · Oromo · Somali · Tigrinya
West Africa

Fulfulde · Hausa · Igbo · Yoruba
Central Africa

Fang · Kituba · Lingala · Sango · Tshiluba
East Africa

Kinyarwanda · Luganda · Swahili
Southern Africa

Afrikaans · Chichewa · English (South African) · isiNdebele · isiXhosa · isiZulu · seSotho · seTswana · Shona · Tshivenḓa · Xitsonga

Dataset reference

This page brings together three references we use often: pan-African and cross-language programmes , a normalised scorecard for comparing public resources by language , and a table that maps external datasets to our catalogue .

Pan-African initiatives

Hubs, campaigns, and vendors that span languages and, in most cases, countries—not single-language corpora. These cards are a short overview; confirm licence terms, coverage, and delivery timelines with each organisation before you rely on them in production.

Swivuriso (ANV · DSFSI)

ASR
- ZA-African Next Voices track with isiZulu, isiXhosa, Sesotho, Setswana, Xitsonga, Tshivenda, and isiNdebele
- Scripted and unscripted first-language speech with transcriptions; CC BY 4.0 on Hugging Face
Availability

Free

Hours (indicative)

~3k+ combined (7 South African languages)

Status

Ongoing

Website
African Voices (ANV DSN)

ASR / TTS
- Data Science Nigeria ANV track focused on Nigerian speech resources with transcriptions
- African Voices platform currently publishes Hausa, Igbo, Nigerian Pidgin, and Yoruba datasets
- Public portal headline reports 1.9k+ audio hours, 1.9m+ sentences, and 500+ unique voices
Availability

Free

Hours (indicative)

~1.9k+ audio hours (platform headline)

Status

Ongoing

Website
African Next Voices (ANV KenCorpus)

ASR
- KenCorpus Consortium / Gates-backed Kenya collection; CC BY 4.0
- Per-language Hugging Face repos: Dholuo, Kikuyu, Somali, Kalenjin, Maasai
- Scripted and unscripted speech across agriculture, healthcare, finance, media, and other domains
Availability

Free

Hours (indicative)

~2.5k+ combined (indicative; see HF cards)

Status

ongoing (WIP per org README)

Website
Masakhane

MT / Text / Some speech (via collaborations)
- 40+ African languages, pan-African
- JW300, FLORES-200, Bible corpora, custom translations
- Strong for low-resource MT/NLP (e.g. Fulfulde, Lingala, Wolof)—not primarily speech
Availability

Free

Status

Ongoing

Website
Mozilla Common Voice (African campaigns)

ASR
- Largest open speech source for many African languages
- Quality and balance vary by locale
Availability

Free

Hours (indicative)

Varies (10–1000+ per language)

Status

Ongoing

Website
Meta MMS

ASR / TTS
- Massively Multilingual Speech covers many African languages for ASR and speech synthesis research
- Strong bootstrap option for under-resourced languages; verify language-tag alignment before production use
Availability

Free

Hours (indicative)

Large multilingual corpus (per-language hours vary)

Status

Ongoing

Website
Google FLEURS

ASR benchmark / speech evaluation
- Consistent multilingual benchmark useful for cross-language ASR comparison
- Better suited to evaluation and baselines than large-scale production training
Availability

Free

Hours (indicative)

~10 per language (~1000+ total train across 102 languages)

Status

Ongoing

Website
ALFFA

ASR
- Early African ASR benchmarks (e.g. Amharic, Hausa, Swahili, Wolof)
- Still used as research baselines
Availability

Free

Hours (indicative)

~20 per language (where released)

Status

Inactive (archive)

Website
SADiLaR / NCHLT

ASR / Text
- Government-backed South African corpora
- Speech, text, and lexicons
Availability

Free

Hours (indicative)

~50 per language (official SA languages)

Status

Ongoing

Website

Comparing open resources across languages

Each language is scored from 0 (weak signal) to 5 (strong signal) on three dimensions—Speech, Text, and Ecosystem—reflecting open and community resources we can see publicly. Values are normalised so you can compare languages fairly. The headline number weights Speech at 50%, Text at 30%, and Ecosystem at 20%: (S × 0.5) + (T × 0.3) + (E × 0.2). Use it as a breadth snapshot, not a judgement on linguistic quality or model performance.

Speech — ASR- and TTS-style data: indicative hours and how practically available collections are.
Text — Parallel corpora, web-scale text, and Masakhane-style NLP footprints.
Ecosystem — Living programmes, commercial support, and research momentum around the language.

Region	Language	Speech	Text	Ecosystem	Score	Tier
North Africa	Arabic (North Africa)	5	5	5	5.0	High
North Africa	Berber (Tamazight)	1	2	2	1.5	Low
Horn of Africa	Amharic	3	4	4	3.5	Good
Horn of Africa	Oromo	1	2	2	1.5	Low
Horn of Africa	Somali	3	3	3	3.0	Good
Horn of Africa	Tigrinya	1	2	2	1.5	Low
West Africa	Fulfulde	1	2	2	1.5	Low
West Africa	Hausa	5	4	5	4.7	High
West Africa	Igbo	3	3	3	3.0	Good
West Africa	Yoruba	4	4	4	4.0	High
Central Africa	Fang	0	1	1	0.5	Very low
Central Africa	Kituba	0	1	1	0.5	Very low
Central Africa	Lingala	1	2	2	1.5	Low
Central Africa	Sango	1	1	1	1.0	Low
Central Africa	Tshiluba	0	1	1	0.5	Very low
East Africa	Kinyarwanda	2	3	3	2.5	Limited
East Africa	Luganda	2	2	2	2.0	Limited
East Africa	Swahili	5	5	5	5.0	High
Southern Africa	Afrikaans	4	4	3	3.8	Good
Southern Africa	Chichewa	2	2	2	2.0	Limited
Southern Africa	English (South African)	5	5	5	5.0	High
Southern Africa	isiNdebele	1	2	2	1.5	Low
Southern Africa	isiXhosa	3	3	4	3.2	Good
Southern Africa	isiZulu	3	3	4	3.2	Good
Southern Africa	seSotho	2	2	3	2.2	Limited
Southern Africa	seTswana	2	2	3	2.2	Limited
Southern Africa	Shona	2	2	2	2.0	Limited
Southern Africa	Tshivenḓa	1	2	2	1.5	Low
Southern Africa	Xitsonga	1	2	2	1.5	Low

Dataset distribution across Africa

Catalogue languages with Way With Words entries alongside linked external datasets.

Language	Dataset	Type	Hours	Availability	Link
Afrikaans	Way With Words · waywithwords/www-za-afr-cx	ASR · conversational	50	Paid	View
	Mozilla Common Voice (Afrikaans)	ASR	500+	Free	Open
	NCHLT Afrikaans	ASR	50	Free	Open
	Meta MMS	ASR/TTS	—	Free · Ongoing	Open
	Google FLEURS	ASR benchmark	~10	Free · Ongoing	Open
Amharic	Mozilla Common Voice (Amharic)	ASR	100	Free	Open
	ALFFA Dataset	ASR	20	Free	Open
	Appen Amharic	ASR	100–500	Commercial	Open
	Meta MMS	ASR/TTS	—	Free · Ongoing	Open
	Google FLEURS	ASR benchmark	~10	Free · Ongoing	Open
Arabic (North Africa)	Mozilla Common Voice (Arabic)	ASR	2000+	Free	Open
	MGB Arabic Corpus	ASR	1200	Restricted	Open
	Appen Arabic Speech Datasets	ASR/TTS	100–1000+	Commercial	Open
	Meta MMS	ASR/TTS	—	Free · Ongoing	Open
	Google FLEURS	ASR benchmark	~10	Free · Ongoing	Open
Berber (Tamazight)	Mozilla Common Voice (Kabyle/Tamazight)	ASR	20–80	Free	Open
Berber (Tamazight)	IRCAM Berber Corpora	Speech/Text	<100	Restricted	Open
Chichewa	Mozilla Common Voice	ASR	50–150	Free	Open
Chichewa	Meta MMS	ASR/TTS	—	Free · Ongoing	Open
Dholuo	ANV (Kenya)	ASR	723	Free	Hugging Face
English (South African)	Way With Words · waywithwords/www-za-eng-cx	ASR · conversational	50	Paid	View
English (South African)	NCHLT / Lwazi Corpus	ASR	200+	Free	Open
Fang	No major public datasets (reference notes)	—	—	Planned / roadmap	—
Fulfulde	Masakhane Corpora	MT/Text	—	Free · Ongoing	Open
Hausa	ANV (AfricanVoices)	ASR	733	Free	African Voices download
	Mozilla Common Voice (Hausa)	ASR	300–500	Free	Open
	ALFFA Hausa	ASR	20	Free	Open
	Appen Hausa	ASR/TTS	500+	Commercial	Open
	Meta MMS	ASR/TTS	—	Free · Ongoing	Open
	Google FLEURS	ASR benchmark	~10	Free · Ongoing	Open
Igbo	ANV (AfricanVoices)	ASR	383	Free	African Voices download
	Mozilla Common Voice (Igbo)	ASR	80–150	Free	Open
	Igbo Speech Dataset	ASR	100	Free	Open
	Meta MMS	ASR/TTS	—	Free · Ongoing	Open
	Google FLEURS	ASR benchmark	~10	Free · Ongoing	Open
isiNdebele	ANV (Swivuriso)	ASR	251.9	Free	Hugging Face
	NCHLT Corpora	ASR	~50	Free	Open
	Mozilla Common Voice (various)	ASR	10–150	Free	Open
	Meta MMS	ASR/TTS	—	Free · Ongoing	Open
isiXhosa	ANV (Swivuriso)	ASR	504.3	Free	Hugging Face
	NCHLT isiXhosa	ASR	50	Free	Open
	Mozilla Common Voice (isiXhosa)	ASR	150–300	Free	Open
	Meta MMS	ASR/TTS	—	Free · Ongoing	Open
	Google FLEURS	ASR benchmark	~10	Free · Ongoing	Open
isiZulu	Way With Words · waywithwords/www-za-zul-cx	ASR · conversational	50	Paid	View
	ANV (Swivuriso)	ASR	502.9	Free	Hugging Face
	NCHLT isiZulu	ASR	50	Free	Open
	Mozilla Common Voice (isiZulu)	ASR	200–400	Free	Open
	Meta MMS	ASR/TTS	—	Free · Ongoing	Open
	Google FLEURS	ASR benchmark	~10	Free · Ongoing	Open
Kalenjin	ANV (Kenya)	ASR	521	Free	Hugging Face
Kikuyu	ANV (Kenya)	ASR	754	Free	Hugging Face
Kinyarwanda	Mozilla Common Voice (Kinyarwanda)	ASR	50–120	Free	Open
	Meta MMS	ASR/TTS	—	Free · Ongoing	Open
	Google FLEURS	ASR benchmark	~10	Free · Ongoing	Open
Kituba	No major public datasets (reference notes)	—	—	Planned / roadmap	—
Lingala	Mozilla Common Voice (Lingala)	ASR	10–30	Free	Open
Lingala	Meta MMS	ASR/TTS	—	Free · Ongoing	Open
Luganda	Mozilla Common Voice (Luganda)	ASR	40–100	Free	Open
	Meta MMS	ASR/TTS	—	Free · Ongoing	Open
	Google FLEURS	ASR benchmark	~10	Free · Ongoing	Open
Maasai	ANV (Kenya)	ASR	505	Free	Hugging Face
Oromo	Mozilla Common Voice (Oromo)	ASR	10–30	Free	Open
Oromo	Meta MMS	ASR/TTS	—	Free · Ongoing	Open
Sango	Bible.is Audio Corpus	Speech/Text	<20	Free	Open
seSotho	Way With Words · waywithwords/www-za-sot-cx	ASR · conversational	50	Paid	View
	ANV (Swivuriso)	ASR	503.6	Free	Hugging Face
	NCHLT Corpora	ASR	~50	Free	Open
	Mozilla Common Voice (various)	ASR	10–150	Free	Open
	Meta MMS	ASR/TTS	—	Free · Ongoing	Open
seTswana	ANV (Swivuriso)	ASR	502.2	Free	Hugging Face
	NCHLT Corpora	ASR	~50	Free	Open
	Mozilla Common Voice (various)	ASR	10–150	Free	Open
	Meta MMS	ASR/TTS	—	Free · Ongoing	Open
Shona	Mozilla Common Voice	ASR	50–150	Free	Open
Shona	Meta MMS	ASR/TTS	—	Free · Ongoing	Open
Somali	ANV (Kenya)	ASR	502	Free	Hugging Face
	Mozilla Common Voice (Somali)	ASR	50–150	Free	Open
	Appen Somali	ASR	100+	Commercial	Open
	Meta MMS	ASR/TTS	—	Free · Ongoing	Open
	Google FLEURS	ASR benchmark	~10	Free · Ongoing	Open
Swahili	Mozilla Common Voice (Swahili)	ASR	1000+	Free	Open
	ALFFA Swahili	ASR	20	Free	Open
	OpenSLR Swahili datasets	ASR	100–300	Free	Open
	Meta MMS	ASR/TTS	—	Free · Ongoing	Open
	Google FLEURS	ASR benchmark	~10	Free · Ongoing	Open
Tigrinya	Mozilla Common Voice (Tigrinya)	ASR	10–20	Free	Open
Tigrinya	Meta MMS	ASR/TTS	—	Free · Ongoing	Open
Tshiluba	No major public datasets (reference notes)	—	—	Planned / roadmap	—
Tshivenḓa	ANV (Swivuriso)	ASR	250.9	Free	Hugging Face
	NCHLT Corpora	ASR	~50	Free	Open
	Mozilla Common Voice (various)	ASR	10–150	Free	Open
	Meta MMS	ASR/TTS	—	Free · Ongoing	Open
Xitsonga	ANV (Swivuriso)	ASR	500.1	Free	Hugging Face
	NCHLT Corpora	ASR	~50	Free	Open
	Mozilla Common Voice (various)	ASR	10–150	Free	Open
	Meta MMS	ASR/TTS	—	Free · Ongoing	Open
Yoruba	ANV (AfricanVoices)	ASR	361	Free	African Voices download
	Mozilla Common Voice (Yoruba)	ASR	500+	Free	Open
	YorubaSpeech	ASR/TTS	100	Free	Open
	ALFFA Yoruba (via Hausa baseline extension)	ASR	~20	Free	Open
	Meta MMS	ASR/TTS	—	Free · Ongoing	Open
	Google FLEURS	ASR benchmark	~10	Free · Ongoing	Open

Community vote

Voters and partners shape what we build next. Vote for the proposal that fits your languages and use case—or suggest your own; we review every submission.

Cast your vote