Introducing WAXAL: A New Open Dataset for African Speech Technology
For people across much of the world, talking to devices is second nature. We ask for directions, get news updates, or transcribe voice notes without a second thought. But this convenience disappears when technology doesn’t speak your language. This is the reality for hundreds of millions of people, especially in Sub-Saharan Africa, where over 2,000 distinct languages are spoken. The main barrier to creating helpful voice technologies for this region has been a lack of accessible, high-quality speech data.
That's why we’re introducing WAXAL. Taking its name from the Wolof word for "speak," this dataset was developed over three years to empower researchers and drive the development of inclusive technology across Africa. The collection provides data for 21 languages, including Acholi, Hausa, Luganda, and Yoruba, and contains over 11,000 hours of speech data from nearly 2 million individual recordings. This includes approximately 1,250 hours of transcribed speech for automatic speech recognition (ASR) and over 20 hours of studio recordings for text-to-speech (TTS) voice synthesis.
A project built by and for the community
WAXAL is a collaborative achievement, powered by the expertise of leading African organizations who were essential partners in the creation of this dataset. Our partners at Makerere University in Uganda and the University of Ghana led data collection for a combined 13 languages, while Digital Umuganda in Rwanda headed the effort for five major languages. For the high-quality voice recordings, we worked with regional experts at Media Trust and Loud n Clear. We also partnered with the African Institute for Mathematical Sciences (AIMS) on multilingual data for future releases.
This framework ensures our partners retain ownership of the data they collected, while working with us toward the shared goal of making these resources available to the global research community.
Capturing authentic speech, ethically
We wanted to capture how people really talk, so we asked participants to describe different pictures in their native languages. We also recorded professional voice actors in the studio to create the high-quality audio needed for text-to-speech technology.
We hope WAXAL will not only fuel innovation but also aid in the digital preservation of African languages. The complete WAXAL collection is released under an open license and is available to access today on Hugging Face, and you can read the full paper for a deep dive into our methodology.
Languages included in the dataset are: Acholi, Akan, Dagaare, Dagbani, Dholuo, Ewe, Fante, Fulani (Fula), Hausa, Igbo, Ikposo (Kposo), Kikuyu, Lingala, Luganda, Malagasy, Masaaba, Nyankole, Rukiga, Shona, Soga (Lusoga), Swahili, and Yoruba.