Comparing data availability by region

Languages across Africa

Explore our catalogue languages by broad region—an orientation map for planning and discovery, not a political boundary.

Legend — catalogue languages

  • North Africa

    Arabic (North Africa) · Berber (Tamazight)

  • Horn of Africa

    Amharic · Oromo · Somali · Tigrinya

  • West Africa

    Fulfulde · Hausa · Igbo · Yoruba

  • Central Africa

    Fang · Kituba · Lingala · Sango · Tshiluba

  • East Africa

    Kinyarwanda · Luganda · Swahili

  • Southern Africa

    Afrikaans · Chichewa · English (South African) · isiNdebele · isiXhosa · isiZulu · seSotho · seTswana · Shona · Tshivenḓa · Xitsonga

West Africa Southern Africa Central Africa East Africa Horn of Africa North Africa North Horn West Central East South

Dataset reference

Pan-African initiatives

Hubs, campaigns, and vendors that span languages and, in most cases, countries—not single-language corpora. These cards are a short overview; confirm licence terms, coverage, and delivery timelines with each organisation before you rely on them in production.

  • Swivuriso (ANV · DSFSI)

    ASR

    • ZA-African Next Voices track with isiZulu, isiXhosa, Sesotho, Setswana, Xitsonga, Tshivenda, and isiNdebele
    • Scripted and unscripted first-language speech with transcriptions; CC BY 4.0 on Hugging Face
    Availability
    Free
    Hours (indicative)
    ~3k+ combined (7 South African languages)
    Status
    Ongoing
    Website
  • African Voices (ANV DSN)

    ASR / TTS

    • Data Science Nigeria ANV track focused on Nigerian speech resources with transcriptions
    • African Voices platform currently publishes Hausa, Igbo, Nigerian Pidgin, and Yoruba datasets
    • Public portal headline reports 1.9k+ audio hours, 1.9m+ sentences, and 500+ unique voices
    Availability
    Free
    Hours (indicative)
    ~1.9k+ audio hours (platform headline)
    Status
    Ongoing
    Website
  • African Next Voices (ANV KenCorpus)

    ASR

    • KenCorpus Consortium / Gates-backed Kenya collection; CC BY 4.0
    • Per-language Hugging Face repos: Dholuo, Kikuyu, Somali, Kalenjin, Maasai
    • Scripted and unscripted speech across agriculture, healthcare, finance, media, and other domains
    Availability
    Free
    Hours (indicative)
    ~2.5k+ combined (indicative; see HF cards)
    Status
    ongoing (WIP per org README)
    Website
  • Masakhane

    MT / Text / Some speech (via collaborations)

    • 40+ African languages, pan-African
    • JW300, FLORES-200, Bible corpora, custom translations
    • Strong for low-resource MT/NLP (e.g. Fulfulde, Lingala, Wolof)—not primarily speech
    Availability
    Free
    Status
    Ongoing
    Website
  • Mozilla Common Voice (African campaigns)

    ASR

    • Largest open speech source for many African languages
    • Quality and balance vary by locale
    Availability
    Free
    Hours (indicative)
    Varies (10–1000+ per language)
    Status
    Ongoing
    Website
  • Meta MMS

    ASR / TTS

    • Massively Multilingual Speech covers many African languages for ASR and speech synthesis research
    • Strong bootstrap option for under-resourced languages; verify language-tag alignment before production use
    Availability
    Free
    Hours (indicative)
    Large multilingual corpus (per-language hours vary)
    Status
    Ongoing
    Website
  • Google FLEURS

    ASR benchmark / speech evaluation

    • Consistent multilingual benchmark useful for cross-language ASR comparison
    • Better suited to evaluation and baselines than large-scale production training
    Availability
    Free
    Hours (indicative)
    ~10 per language (~1000+ total train across 102 languages)
    Status
    Ongoing
    Website
  • ALFFA

    ASR

    • Early African ASR benchmarks (e.g. Amharic, Hausa, Swahili, Wolof)
    • Still used as research baselines
    Availability
    Free
    Hours (indicative)
    ~20 per language (where released)
    Status
    Inactive (archive)
    Website
  • SADiLaR / NCHLT

    ASR / Text

    • Government-backed South African corpora
    • Speech, text, and lexicons
    Availability
    Free
    Hours (indicative)
    ~50 per language (official SA languages)
    Status
    Ongoing
    Website

Comparing open resources across languages

Each language is scored from 0 (weak signal) to 5 (strong signal) on three dimensions—Speech, Text, and Ecosystem—reflecting open and community resources we can see publicly. Values are normalised so you can compare languages fairly. The headline number weights Speech at 50%, Text at 30%, and Ecosystem at 20%: (S × 0.5) + (T × 0.3) + (E × 0.2). Use it as a breadth snapshot, not a judgement on linguistic quality or model performance.

  • Speech — ASR- and TTS-style data: indicative hours and how practically available collections are.
  • Text — Parallel corpora, web-scale text, and Masakhane-style NLP footprints.
  • Ecosystem — Living programmes, commercial support, and research momentum around the language.
Region Language Speech Text Ecosystem Score Tier
North Africa Arabic (North Africa) 5 5 5 5.0 High
North Africa Berber (Tamazight) 1 2 2 1.5 Low
Horn of Africa Amharic 3 4 4 3.5 Good
Horn of Africa Oromo 1 2 2 1.5 Low
Horn of Africa Somali 3 3 3 3.0 Good
Horn of Africa Tigrinya 1 2 2 1.5 Low
West Africa Fulfulde 1 2 2 1.5 Low
West Africa Hausa 5 4 5 4.7 High
West Africa Igbo 3 3 3 3.0 Good
West Africa Yoruba 4 4 4 4.0 High
Central Africa Fang 0 1 1 0.5 Very low
Central Africa Kituba 0 1 1 0.5 Very low
Central Africa Lingala 1 2 2 1.5 Low
Central Africa Sango 1 1 1 1.0 Low
Central Africa Tshiluba 0 1 1 0.5 Very low
East Africa Kinyarwanda 2 3 3 2.5 Limited
East Africa Luganda 2 2 2 2.0 Limited
East Africa Swahili 5 5 5 5.0 High
Southern Africa Afrikaans 4 4 3 3.8 Good
Southern Africa Chichewa 2 2 2 2.0 Limited
Southern Africa English (South African) 5 5 5 5.0 High
Southern Africa isiNdebele 1 2 2 1.5 Low
Southern Africa isiXhosa 3 3 4 3.2 Good
Southern Africa isiZulu 3 3 4 3.2 Good
Southern Africa seSotho 2 2 3 2.2 Limited
Southern Africa seTswana 2 2 3 2.2 Limited
Southern Africa Shona 2 2 2 2.0 Limited
Southern Africa Tshivenḓa 1 2 2 1.5 Low
Southern Africa Xitsonga 1 2 2 1.5 Low

Dataset distribution across Africa

Catalogue languages with Way With Words entries alongside linked external datasets.

Language Dataset Type Hours Availability Link
Afrikaans Way With Words · waywithwords/www-za-afr-cx ASR · conversational 50 Paid View
Mozilla Common Voice (Afrikaans) ASR 500+ Free Open
NCHLT Afrikaans ASR 50 Free Open
Meta MMS ASR/TTS Free · Ongoing Open
Google FLEURS ASR benchmark ~10 Free · Ongoing Open
Amharic Mozilla Common Voice (Amharic) ASR 100 Free Open
ALFFA Dataset ASR 20 Free Open
Appen Amharic ASR 100–500 Commercial Open
Meta MMS ASR/TTS Free · Ongoing Open
Google FLEURS ASR benchmark ~10 Free · Ongoing Open
Arabic (North Africa) Mozilla Common Voice (Arabic) ASR 2000+ Free Open
MGB Arabic Corpus ASR 1200 Restricted Open
Appen Arabic Speech Datasets ASR/TTS 100–1000+ Commercial Open
Meta MMS ASR/TTS Free · Ongoing Open
Google FLEURS ASR benchmark ~10 Free · Ongoing Open
Berber (Tamazight) Mozilla Common Voice (Kabyle/Tamazight) ASR 20–80 Free Open
IRCAM Berber Corpora Speech/Text <100 Restricted Open
Chichewa Mozilla Common Voice ASR 50–150 Free Open
Meta MMS ASR/TTS Free · Ongoing Open
Dholuo ANV (Kenya) ASR 723 Free Hugging Face
English (South African) Way With Words · waywithwords/www-za-eng-cx ASR · conversational 50 Paid View
NCHLT / Lwazi Corpus ASR 200+ Free Open
Fang No major public datasets (reference notes) Planned / roadmap
Fulfulde Masakhane Corpora MT/Text Free · Ongoing Open
Hausa ANV (AfricanVoices) ASR 733 Free African Voices download
Mozilla Common Voice (Hausa) ASR 300–500 Free Open
ALFFA Hausa ASR 20 Free Open
Appen Hausa ASR/TTS 500+ Commercial Open
Meta MMS ASR/TTS Free · Ongoing Open
Google FLEURS ASR benchmark ~10 Free · Ongoing Open
Igbo ANV (AfricanVoices) ASR 383 Free African Voices download
Mozilla Common Voice (Igbo) ASR 80–150 Free Open
Igbo Speech Dataset ASR 100 Free Open
Meta MMS ASR/TTS Free · Ongoing Open
Google FLEURS ASR benchmark ~10 Free · Ongoing Open
isiNdebele ANV (Swivuriso) ASR 251.9 Free Hugging Face
NCHLT Corpora ASR ~50 Free Open
Mozilla Common Voice (various) ASR 10–150 Free Open
Meta MMS ASR/TTS Free · Ongoing Open
isiXhosa ANV (Swivuriso) ASR 504.3 Free Hugging Face
NCHLT isiXhosa ASR 50 Free Open
Mozilla Common Voice (isiXhosa) ASR 150–300 Free Open
Meta MMS ASR/TTS Free · Ongoing Open
Google FLEURS ASR benchmark ~10 Free · Ongoing Open
isiZulu Way With Words · waywithwords/www-za-zul-cx ASR · conversational 50 Paid View
ANV (Swivuriso) ASR 502.9 Free Hugging Face
NCHLT isiZulu ASR 50 Free Open
Mozilla Common Voice (isiZulu) ASR 200–400 Free Open
Meta MMS ASR/TTS Free · Ongoing Open
Google FLEURS ASR benchmark ~10 Free · Ongoing Open
Kalenjin ANV (Kenya) ASR 521 Free Hugging Face
Kikuyu ANV (Kenya) ASR 754 Free Hugging Face
Kinyarwanda Mozilla Common Voice (Kinyarwanda) ASR 50–120 Free Open
Meta MMS ASR/TTS Free · Ongoing Open
Google FLEURS ASR benchmark ~10 Free · Ongoing Open
Kituba No major public datasets (reference notes) Planned / roadmap
Lingala Mozilla Common Voice (Lingala) ASR 10–30 Free Open
Meta MMS ASR/TTS Free · Ongoing Open
Luganda Mozilla Common Voice (Luganda) ASR 40–100 Free Open
Meta MMS ASR/TTS Free · Ongoing Open
Google FLEURS ASR benchmark ~10 Free · Ongoing Open
Maasai ANV (Kenya) ASR 505 Free Hugging Face
Oromo Mozilla Common Voice (Oromo) ASR 10–30 Free Open
Meta MMS ASR/TTS Free · Ongoing Open
Sango Bible.is Audio Corpus Speech/Text <20 Free Open
seSotho Way With Words · waywithwords/www-za-sot-cx ASR · conversational 50 Paid View
ANV (Swivuriso) ASR 503.6 Free Hugging Face
NCHLT Corpora ASR ~50 Free Open
Mozilla Common Voice (various) ASR 10–150 Free Open
Meta MMS ASR/TTS Free · Ongoing Open
seTswana ANV (Swivuriso) ASR 502.2 Free Hugging Face
NCHLT Corpora ASR ~50 Free Open
Mozilla Common Voice (various) ASR 10–150 Free Open
Meta MMS ASR/TTS Free · Ongoing Open
Shona Mozilla Common Voice ASR 50–150 Free Open
Meta MMS ASR/TTS Free · Ongoing Open
Somali ANV (Kenya) ASR 502 Free Hugging Face
Mozilla Common Voice (Somali) ASR 50–150 Free Open
Appen Somali ASR 100+ Commercial Open
Meta MMS ASR/TTS Free · Ongoing Open
Google FLEURS ASR benchmark ~10 Free · Ongoing Open
Swahili Mozilla Common Voice (Swahili) ASR 1000+ Free Open
ALFFA Swahili ASR 20 Free Open
OpenSLR Swahili datasets ASR 100–300 Free Open
Meta MMS ASR/TTS Free · Ongoing Open
Google FLEURS ASR benchmark ~10 Free · Ongoing Open
Tigrinya Mozilla Common Voice (Tigrinya) ASR 10–20 Free Open
Meta MMS ASR/TTS Free · Ongoing Open
Tshiluba No major public datasets (reference notes) Planned / roadmap
Tshivenḓa ANV (Swivuriso) ASR 250.9 Free Hugging Face
NCHLT Corpora ASR ~50 Free Open
Mozilla Common Voice (various) ASR 10–150 Free Open
Meta MMS ASR/TTS Free · Ongoing Open
Xitsonga ANV (Swivuriso) ASR 500.1 Free Hugging Face
NCHLT Corpora ASR ~50 Free Open
Mozilla Common Voice (various) ASR 10–150 Free Open
Meta MMS ASR/TTS Free · Ongoing Open
Yoruba ANV (AfricanVoices) ASR 361 Free African Voices download
Mozilla Common Voice (Yoruba) ASR 500+ Free Open
YorubaSpeech ASR/TTS 100 Free Open
ALFFA Yoruba (via Hausa baseline extension) ASR ~20 Free Open
Meta MMS ASR/TTS Free · Ongoing Open
Google FLEURS ASR benchmark ~10 Free · Ongoing Open

Community vote

Voters and partners shape what we build next. Vote for the proposal that fits your languages and use case—or suggest your own; we review every submission.

Cast your vote