Community dataset vote

Cast your vote, shape what we build next

Help us prioritise the datasets that researchers, institutions and product teams are most likely to need next. We are building high-quality, off-the-shelf datasets for commercial use, shaped by real demand. Where there is strong demand and clear wider value, we may also release selected datasets openly on Hugging Face for broader research use.

We are opening up our roadmap.

Instead of guessing what the community needs, we are inviting you to vote on the datasets you would like to see next, whether that is a specific language, accent, domain, or use case.

We also encourage researchers, universities, and institutions to contribute ideas, collaborate, or help support datasets they would like to see produced.

Live vote snapshot

20 votes total

Top-voted datasets move first into feasibility and production planning.

#1 current priority

Camfranglais Conversational Speech and Annotation Dataset (CCSAD)

Camfranglais

5 votes

#2 current priority

Multilingual Multi-Speaker Conversational Corpus

ENG, AFR, ZUL, SOT, XHO

4 votes

#3 current priority

Code-Switching Conversational Corpus

SWA, YOR, HAU

4 votes

Get your vote in by: Friday, 15 May 2026

How it works

STEP 1

Browse proposed datasets

STEP 2

Vote for what matters most

STEP 3

We prioritise by demand and feasibility

STEP 4

Access datasets as they become available

Why take part?

COMMUNITY-DRIVEN

Driven by real demand and aligned with what can be responsibly delivered.

OPEN ACCESS

Where possible, datasets are made accessible for research and commercial use.

AFRICAN-FOCUSED

Supporting underrepresented languages and accents.

HIGH QUALITY

Produced with the same standards as our commercial datasets.

Leaderboard

Total: 20 votes

Camfranglais Conversational Speech and Annotation Dataset (CCSAD)

Conversational Speech (Code-switching + Annotation) Newly added

Camfranglais | Urban Conversations, Sociolinguistics, Digital Humanities | Recordings + Transcripts

5 votes (25%)

Multilingual Multi-Speaker Conversational Corpus

Conversational Speech (Multi-speaker)

ENG, AFR, ZUL, SOT, XHO | Customer Support, FinTech, Telecom | Recordings + Transcripts

4 votes (20%)

Code-Switching Conversational Corpus

Conversational Speech (Multi-speaker)

SWA, YOR, HAU | Customer Support, Healthcare, Agriculture | Recordings + Transcripts

4 votes (20%)

Parallel Multilingual Speech Translation Corpus (Healthcare)

Parallel Speech (Scripted + Translated)

ENG, AFR, ZUL, SOT, XHO | Healthcare | Recordings + Transcripts + Translations

3 votes (15%)

Multilingual Multimodal South African Culture Corpus

Multimodal (Image + Speech Prompting)

ENG, AFR, ZUL, SOT, XHO | South African culture | Recordings + Transcripts + Images

2 votes (10%)

Underrepresented SA Languages Conversational Corpus

Conversational Speech (Multi-speaker)

TSN, NSO, SSW, VEN, NBL, TSO | Customer Support, FinTech, Education | Recordings + Transcripts

2 votes (10%)

Everyday Language Instruction Benchmark (Q&A)

LLM Q&A (Instructional / How-to)

Shona | Daily Life, Home and Family, Education, Health Access, Micro-business | Validated Q&A Pairs

0 votes (0%)


Help us reach more voters

More votes help us prioritise datasets faster.

Language coverage

See the interactive Languages across Africa map for countries and dataset coverage for languages in each zone.

Open interactive map

Proposed collections

Compare each proposal by objective, modality, scale, and quality focus.

Showing 1-7 of 7 dataset options.

Option 1

Newly added Conversational Speech (Code-switching + Annotation)

Camfranglais Conversational Speech and Annotation Dataset (CCSAD)

Structured Camfranglais conversational speech dataset capturing natural code-switching, lexical innovation, and context-dependent meaning in real usage contexts.

Type
Conversational Speech (Code-switching + Annotation)
Domain
Urban Conversations, Sociolinguistics, Digital Humanities
Languages
Camfranglais
Deliverables
Recordings + Transcripts

Objective: Build a research-oriented, ethically collected Camfranglais corpus that documents naturally occurring hybrid speech with layered annotations for linguistic research, preservation, and future low-resource language technology work.

Option 2

Conversational Speech (Multi-speaker)

Multilingual Multi-Speaker Conversational Corpus

Multi-speaker conversational recordings designed for realistic dialogue behaviour, sentiment analysis, and robust multilingual speech modeling with code-switching.

Type
Conversational Speech (Multi-speaker)
Domain
Customer Support, FinTech, Telecom
Languages
ENG, AFR, ZUL, SOT, XHO
Deliverables
Recordings + Transcripts

Objective: Create a high-coverage multilingual conversational dataset with natural turn-taking and domain-relevant interactions for speech and language model training.

Option 3

Conversational Speech (Multi-speaker)

Code-Switching Conversational Corpus

Multi-speaker conversational dataset focused on natural code-switching across Swahili, Yoruba, and Hausa for robust multilingual dialogue modeling.

Type
Conversational Speech (Multi-speaker)
Domain
Customer Support, Healthcare, Agriculture
Languages
SWA, YOR, HAU
Deliverables
Recordings + Transcripts

Objective: Build a high-quality conversational corpus that captures realistic turn-taking and intra-utterance code-switching patterns across three widely used African languages.

Option 4

Conversational Speech (Multi-speaker)

Underrepresented SA Languages Conversational Corpus

Multi-speaker conversational dataset focused on underrepresented South African languages with natural code-switching patterns for robust multilingual speech modeling.

Type
Conversational Speech (Multi-speaker)
Domain
Customer Support, FinTech, Education
Languages
TSN, NSO, SSW, VEN, NBL, TSO
Deliverables
Recordings + Transcripts

Objective: Build a high-quality conversational corpus that captures natural turn-taking and code-switching behaviour across six underrepresented South African languages.

Option 5

Multimodal (Image + Speech Prompting)

Multilingual Multimodal South African Culture Corpus

Rights-cleared cultural image corpus paired with multilingual spoken prompts to support vision-language grounding in underrepresented local contexts.

Type
Multimodal (Image + Speech Prompting)
Domain
South African culture
Languages
ENG, AFR, ZUL, SOT, XHO
Deliverables
Recordings + Transcripts + Images

Objective: Build a legally compliant multimodal benchmark where each image is paired with language-specific spoken prompts for model training and evaluation.

Option 6

LLM Q&A (Instructional / How-to)

Everyday Language Instruction Benchmark (Q&A)

A culturally grounded Shona instruction and Q&A dataset for practical everyday guidance, designed to support localised LLM training and evaluation.

Type
LLM Q&A (Instructional / How-to)
Domain
Daily Life, Home and Family, Education, Health Access, Micro-business
Languages
Shona
Deliverables
Validated Q&A Pairs

Objective: Build a domain-balanced Shona instruction and Q&A resource for practical everyday guidance, with approximately 1,500 validated pairs written in locally natural phrasing to support localised LLM evaluation and fine-tuning.

Option 7

Parallel Speech (Scripted + Translated)

Parallel Multilingual Speech Translation Corpus (Healthcare)

Parallel healthcare speech corpus with aligned multilingual scripts to improve ASR, speech translation, and cross-lingual instruction fidelity.

Type
Parallel Speech (Scripted + Translated)
Domain
Healthcare
Languages
ENG, AFR, ZUL, SOT, XHO
Deliverables
Recordings + Transcripts + Translations

Objective: Create a high-quality parallel speech dataset with semantic alignment across 5 languages for multilingual, voice-to-voice healthcare communication tasks.

Cast your vote

One vote per email address.

Optional and no pressure - your vote matters equally either way.

By submitting, you agree that Way With Words may use the information you provide to administer this initiative and understand language-priority demand. See our Privacy Notice.

Who voted and for what

Displayed as Name + Surname initial + organisation + selected dataset.

Voter Company / Institution Dataset
Anonymous voter Withheld Multilingual Multi-Speaker Conversational Corpus
Anonymous voter Withheld Code-Switching Conversational Corpus
Anonymous voter Withheld Multilingual Multimodal South African Culture Corpus
Anonymous voter Withheld Parallel Multilingual Speech Translation Corpus (Healthcare)
Paul F. Open Cities Lab Multilingual Multimodal South African Culture Corpus
Anonymous voter Withheld Parallel Multilingual Speech Translation Corpus (Healthcare)
Anonymous voter Withheld Code-Switching Conversational Corpus
Anonymous voter Withheld Code-Switching Conversational Corpus
Howard L. Gates foundation Camfranglais Conversational Speech and Annotation Dataset (CCSAD)
Anonymous voter Withheld Multilingual Multi-Speaker Conversational Corpus

Showing 1-10 of 20

Previous
1 2
Next

Have another dataset idea?

Suggest a new dataset for internal review. You can choose multiple languages and dataset types.