The Esethu Framework
A sustainable data curation framework designed to empower local communities and ensure equitable benefit-sharing from their linguistic resources. It reimagines how low-resource language datasets are created, licensed, and reinvested into future AI systems.
What is the Esethu Framework?
The Esethu Framework is a sustainable data curation and licensing model that supports repeatable, cost-aware dataset development while giving language communities clear governance over how their data is used. It is supported by the Esethu License, a novel community-centric data license.
Developed by Lelapa AI in collaboration with Way With Words and Data Science for Social Impact (DSFSI), the framework addresses structural inefficiencies in how low-resource language data is sourced and reused—aligning ethical governance with sustainable commercial pathways.
Our contribution
Way With Words contributed the speech data used to develop and validate the Esethu Framework. This data underpins the methodology and experiments described in the research presented at ACL 2025 and published on arXiv, and it supports the first proof-of-concept dataset released under the framework.
We are proud to partner with Lelapa AI and DSFSI to advance sustainable, community-centred practices for African language AI.
Key features
Sustainable licensing
The Esethu License introduces a community-aware commercial pathway: responsible use of language data with reinvestment into future dataset creation, so high-quality data remains available without repeated extraction cycles.
Community-led development
Local linguists and native speakers lead dataset creation, ensuring authenticity and diversity. The framework safeguards the interests of data creators while bridging resource gaps in ASR for African languages.
Scalability & replicability
The framework is designed to be applied across multiple low-resource languages, enabling consistent, repeatable dataset development that can scale across regions and use cases.
Proof of concept: ViXSD
The Vuk'uzenzele isiXhosa Speech Dataset (ViXSD) is the first dataset developed under the Esethu Framework and License. It is an open-source ASR corpus of read speech from native isiXhosa speakers, enriched with demographic and linguistic metadata. ViXSD demonstrates how community-driven licensing and curation can support voice-driven applications for isiXhosa while ensuring long-term, ethical data governance.
- 10 hours of high-quality isiXhosa speech data
- Diverse speakers across dialects, age groups, and regions
- Ethical licensing that supports future isiXhosa data growth
Resources & links
Explore the framework, papers, and dataset.
- Framework & research paper (arXiv)
The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages
- ACL 2025 long paper
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), Vienna.
- Lelapa AI announcement
A Global First: How a New Sustainable Data Framework & License Are Transforming Language AI
- ViXSD dataset (Hugging Face)
Vuk'uzenzele isiXhosa Speech Dataset — first dataset developed under the Esethu Framework.
- Esethu License
Community-centric data license supporting equitable benefit-sharing.
Work with us on sustainable data
Interested in datasets under the Esethu Framework or in building ethical, community-centred speech data for other African languages? We'd love to hear from you.
Get in touch