Community dataset vote

Cast your vote, shape what we build next

Help us prioritise the datasets that matter most. We are building high-quality, off-the-shelf datasets for commercial use, shaped by real demand. Where there is strong interest and clear community benefit, we will also explore opportunities to release datasets openly on Hugging Face.

We are opening up our roadmap.

Instead of guessing what the community needs, we are inviting you to vote on the datasets you would like to see next, whether that is a specific language, accent, domain, or use case.

We also encourage researchers, universities, and institutions to contribute ideas, collaborate, or help support datasets they would like to see produced.

Live vote snapshot

0 total votes

Top-voted datasets move first into feasibility and production planning.

#1 current priority

Multilingual Multimodal South African Culture Corpus

ENG, AFR, ZUL, SOT, XHO

0 votes

#2 current priority

Parallel Multilingual Speech Translation Corpus (Healthcare)

ENG, AFR, ZUL, SOT, XHO

0 votes

#3 current priority

Multilingual Multi-Speaker Conversational Corpus

ENG, AFR, ZUL, SOT, XHO

0 votes

Get your vote in by: Friday, 01 May 2026

How it works

STEP 1

Browse proposed datasets

STEP 2

Vote for what matters most

STEP 3

We prioritise by demand and feasibility

STEP 4

Access datasets as they become available

Why take part?

COMMUNITY-DRIVEN

Driven by real demand and aligned with what can be responsibly delivered.

OPEN ACCESS

Where possible, datasets are made accessible for research and commercial use.

AFRICAN-FOCUSED

Supporting underrepresented languages and accents.

HIGH QUALITY

Produced with the same standards as our commercial datasets.

Leaderboard

Total votes: 0

Multilingual Multimodal South African Culture Corpus

Multimodal (Image + Speech Prompting)

ENG, AFR, ZUL, SOT, XHO | South African culture | Recordings + Transcripts + Images

0 votes (0%)

Parallel Multilingual Speech Translation Corpus (Healthcare)

Parallel Speech (Scripted + Translated)

ENG, AFR, ZUL, SOT, XHO | Healthcare | Recordings + Transcripts + Translations

0 votes (0%)

Multilingual Multi-Speaker Conversational Corpus

Conversational Speech (Multi-speaker)

ENG, AFR, ZUL, SOT, XHO | Customer Support, FinTech, Telecom | Recordings + Transcripts

0 votes (0%)

Code-Switching Conversational Corpus

Conversational Speech (Multi-speaker)

SWA, YOR, HAU | Customer Support, Healthcare, Agriculture | Recordings + Transcripts

0 votes (0%)

Underrepresented SA Languages Conversational Corpus

Conversational Speech (Multi-speaker)

TSN, NSO, SSW, VEN, NBL, TSO | Customer Support, FinTech, Education | Recordings + Transcripts

0 votes (0%)


Help us reach more voters

More votes help us prioritise datasets faster.

Language coverage

See the interactive Languages across Africa map for countries and dataset coverage for languages in each zone.

Open interactive map

Proposed collections

Compare each proposal by objective, modality, scale, quality focus, and risk profile.

Showing 1-5 of 5 dataset options.

Option 1

Conversational Speech (Multi-speaker)

Multilingual Multi-Speaker Conversational Corpus

Multi-speaker conversational recordings designed for realistic dialogue behaviour, sentiment analysis, and robust multilingual speech modeling with code-switching.

Type
Conversational Speech (Multi-speaker)
Domain
Customer Support, FinTech, Telecom
Languages
ENG, AFR, ZUL, SOT, XHO
Deliverables
Recordings + Transcripts

Objective: Create a high-coverage multilingual conversational dataset with natural turn-taking and domain-relevant interactions for speech and language model training.

Option 2

Multimodal (Image + Speech Prompting)

Multilingual Multimodal South African Culture Corpus

Rights-cleared cultural image corpus paired with multilingual spoken prompts to support vision-language grounding in underrepresented local contexts.

Type
Multimodal (Image + Speech Prompting)
Domain
South African culture
Languages
ENG, AFR, ZUL, SOT, XHO
Deliverables
Recordings + Transcripts + Images

Objective: Build a legally compliant multimodal benchmark where each image is paired with language-specific spoken prompts for model training and evaluation.

Option 3

Parallel Speech (Scripted + Translated)

Parallel Multilingual Speech Translation Corpus (Healthcare)

Parallel healthcare speech corpus with aligned multilingual scripts to improve ASR, speech translation, and cross-lingual instruction fidelity.

Type
Parallel Speech (Scripted + Translated)
Domain
Healthcare
Languages
ENG, AFR, ZUL, SOT, XHO
Deliverables
Recordings + Transcripts + Translations

Objective: Create a high-quality parallel speech dataset with semantic alignment across 5 languages for multilingual, voice-to-voice healthcare communication tasks.

Option 4

Conversational Speech (Multi-speaker)

Code-Switching Conversational Corpus

Multi-speaker conversational dataset focused on natural code-switching across Swahili, Yoruba, and Hausa for robust multilingual dialogue modeling.

Type
Conversational Speech (Multi-speaker)
Domain
Customer Support, Healthcare, Agriculture
Languages
SWA, YOR, HAU
Deliverables
Recordings + Transcripts

Objective: Build a high-quality conversational corpus that captures realistic turn-taking and intra-utterance code-switching patterns across three widely used African languages.

Option 5

Conversational Speech (Multi-speaker)

Underrepresented SA Languages Conversational Corpus

Multi-speaker conversational dataset focused on underrepresented South African languages with natural code-switching patterns for robust multilingual speech modeling.

Type
Conversational Speech (Multi-speaker)
Domain
Customer Support, FinTech, Education
Languages
TSN, NSO, SSW, VEN, NBL, TSO
Deliverables
Recordings + Transcripts

Objective: Build a high-quality conversational corpus that captures natural turn-taking and code-switching behaviour across six underrepresented South African languages.

Cast your vote

One vote per email address.

Optional and no pressure - your vote matters equally either way.

Who voted and for what

Displayed as Name + Surname initial + organisation + selected dataset.

Voter Company / Institution Dataset
No votes yet. Be the first to vote.

Have another dataset idea?

Suggest a new dataset for internal review. You can choose multiple languages and dataset types.