Sustainable data governance

The Esethu Framework

A sustainable data curation framework designed to empower local communities and ensure equitable benefit-sharing from their linguistic resources. It reimagines how low-resource language datasets are created, licensed, and reinvested into future AI systems.

What is the Esethu Framework?

The Esethu Framework is a sustainable data curation and licensing model that supports repeatable, cost-aware dataset development while giving language communities clear governance over how their data is used. It is supported by the Esethu License, a novel community-centric data license.

Developed by Lelapa AI in collaboration with Way With Words and Data Science for Social Impact (DSFSI), the framework addresses structural inefficiencies in how low-resource language data is sourced and reused—aligning ethical governance with sustainable commercial pathways.

Our contribution

Way With Words contributed the speech data used to develop and validate the Esethu Framework. This data underpins the methodology and experiments described in the research presented at ACL 2025 and published on arXiv, and it supports the first proof-of-concept dataset released under the framework.

We are proud to partner with Lelapa AI and DSFSI to advance sustainable, community-centred practices for African language AI.

Key features

Sustainable licensing

The Esethu License introduces a community-aware commercial pathway: responsible use of language data with reinvestment into future dataset creation, so high-quality data remains available without repeated extraction cycles.

Community-led development

Local linguists and native speakers lead dataset creation, ensuring authenticity and diversity. The framework safeguards the interests of data creators while bridging resource gaps in ASR for African languages.

Scalability & replicability

The framework is designed to be applied across multiple low-resource languages, enabling consistent, repeatable dataset development that can scale across regions and use cases.

Proof of concept: ViXSD

The Vuk'uzenzele isiXhosa Speech Dataset (ViXSD) is the first dataset developed under the Esethu Framework and License. It is an open-source ASR corpus of read speech from native isiXhosa speakers, enriched with demographic and linguistic metadata. ViXSD demonstrates how community-driven licensing and curation can support voice-driven applications for isiXhosa while ensuring long-term, ethical data governance.

  • 10 hours of high-quality isiXhosa speech data
  • Diverse speakers across dialects, age groups, and regions
  • Ethical licensing that supports future isiXhosa data growth
View ViXSD on Hugging Face

Work with us on sustainable data

Interested in datasets under the Esethu Framework or in building ethical, community-centred speech data for other African languages? We'd love to hear from you.

Get in touch