Tshivenda speech dataset
Part of Swivuriso (ZA-African Next Voices), a large-scale multilingual speech dataset for South African languages. This configuration contains high-quality, first-language Tshivenda speech: over 250 hours of scripted and unscripted audio, collected through ethical community-centred processes. Designed for ASR and inclusive speech technologies. Available free on Hugging Face under CC BY 4.0.
Looking for more options? Browse the full African speech datasets catalog or see our community-centric data licensing framework.
Key details
- Hours available
- 250.9
- Speakers
- 104
- Access
- Available on Hugging Face
- Audio format
- WAV (48kHz mono)
- Accents
- South African Tshivenda
Dataset details
Hours available
250.9
Age range
18 - 60+
Download size
Available on Hugging Face
Number of speakers
104
Audio format
WAV (48kHz mono)
Accents
South African Tshivenda
Additional information
What is Africa Next Voices?
Africa Next Voices (ANV) is a large-scale initiative supported by the Gates Foundation and a network of research and technology partners to expand high-quality speech datasets for African languages. In South Africa, the project was coordinated by the Data Science for Social Impact (DSFSI) group at the University of Pretoria. Way With Words acted as the data production and workflow partner — designing and running recording, transcription, proofing, and quality control to deliver the South African component.
How was this data collected?
The South African ANV dataset combines scripted and unscripted speech. Contributors were recruited from across the country and trained to record in their first language. Recordings were transcribed, proofed, and quality-checked by language specialists. The result is thousands of hours of ethically collected, community-driven speech that reflects how people actually speak — not scraped or synthetic sources.
How can I use this dataset?
The full multi-language dataset (Swivuriso) is available on Hugging Face. You can load data by language (e.g. isiZulu, isiXhosa, seSotho). Use restrictions apply: the data is not licensed for text-to-speech, voice cloning, or voice synthesis. For research, ASR, and language model training, see the dataset card and license on Hugging Face for full terms.
Who contributed to this project?
Thousands of South Africans — recorders, proofreaders, and language assistants — gave their time and voices to build this resource. We honoured participants with personalised certificates and fair compensation. For a contributor’s perspective on what it meant to be part of ANV, read Beyond the Data; for how we recognised everyone involved, see Honouring the Individuals Who Made Africa Next Voices Possible.
More languages & resources
Swivuriso includes all 7 South African languages. On Hugging Face you can load by language (e.g. zul, xho, sot). Use restrictions apply: not for TTS, voice cloning, or voice synthesis.