Tshivenda speech dataset
Part of Swivuriso (ZA-African Next Voices), a large-scale multilingual speech dataset for South African languages. This configuration contains high-quality, first-language Tshivenda speech: over 250 hours of scripted and unscripted audio, collected through ethical community-centred processes. Designed for ASR and inclusive speech technologies. Available free on Hugging Face under CC BY 4.0.
Key details
- Hours available
- 250.9 hours
- Speakers
- 104
- Access
- Available on Hugging Face
- Audio format
- WAV (48kHz mono)
- Accents
- South African Tshivenda
Dataset details
Hours available
250.9 hours
Age range
18 - 60+
Download size
Available on Hugging Face
Number of speakers
104
Audio format
WAV (48kHz mono)
Accents
South African Tshivenda
Additional information
How are dataset recordings structured?
Our off-the-shelf dataset collections comprise unscripted, natural conversations conducted by call recorders recruited, trained, and approved to simulate real-world conversations in common domains. Recordings and transcripts include routine security verifications such as ID, email, and phone number validation.
How do you recruit for speech collection datasets?
Our priority is to create datasets that are unbiased and cover as wide a range of demographics as possible. That is the first consideration when we begin the planning and recruitment process of any speech collection dataset project.
What kind of agreement is in place for the purchase of this dataset?
A Licence Agreement governs the sale and usage of this speech collection dataset. Our off-the-shelf options are available for clients to test and benchmark before larger, custom commitments can be considered that are better suited to client requirements and conventions.
More languages & resources
Swivuriso includes all 7 South African languages. On Hugging Face you can load by language (e.g. zul, xho, sot). Use restrictions apply: not for TTS, voice cloning, or voice synthesis.