Supporting Natural Language Processing (NLP) in Africa
Language is what connects us to each other and the world around us. While Africa is home to a third of the world's languages, technology is not yet available for many of its languages. This is an important challenge to tackle because language is more than a vehicle for communication. It is also a marker of identity, belonging, and opportunity. This is why we want to make sure you can understand and be understood, in any language of your choosing. It's a significant technical challenge to make this dream a reality, but we’re committed to and working towards this goal.
One of the challenges everyone faces in this space is the scarcity of machine readable language data which can be used to build technology. For many languages, it is difficult to find or it simply does not exist. Diversity gaps in Natural Language Processing (NLP) education and academia also narrow representation among language technologists working on lesser-resourced languages. Democratizing access to underrepresented languages data and increasing NLP education helps drive NLP research and advance language technology.
As part of our continued commitment and investment in digital transformation in Africa, Google teams have been working on programs to advance language technologies that serve the region, such as: adding 24 new languages to Google Translate earlier at I/O (including Bambara, Ewe, Krio, Lingala, Luganda, Tsonga and Twi), researching how to build speech recognition in African languages, and supporting local researchers through initiatives like Lacuna Fund. Community initiatives launched in India expanded to Africa, resulting in open-sourced crowdsourced datasets for speech applications in Nigerian English and Yoruba, and new community initiatives and workshops like Explore ML with Crowdsource are gaining momentum in multiple African countries. We also hosted our first community workshop in the field of NLP and African languages in our growing AI research center in Ghana, which is also looking into how to advance NLP for African languages.
One more recent example of our language initiatives in the continent comes from a partnership with Africans to invest in African languages and NLP technology: in collaboration with Zindi, a social enterprise and professional network for data science we organized a series of Natural Language Processing (NLP) hackathons in Africa. The series included an Africa Automatic Speech Recognition (ASR) workshop and three hackathon challenges centered on model training for speech recognition, sentiment analysis, and speech data collection.
The interactive workshop aimed to increase awareness and skills for NLP in Africa, especially among researchers, students, and data scientists new to NLP. The workshop provided a beginner-friendly introduction to NLP and ASR, including a step by step guide on how to train a speech model for a new language. Participants also learned about the challenges and progress of work in the Africa NLP space and opportunities to get involved with data science and grow their careers.
In the Intro to Speech Recognition Africa Challenge, participants collected speech data for African languages and trained their own speech recognition models with it. This challenge generated new datasets in African languages, including the open-source datasets released by the challenge winners in Fongbe, Wolof, Swahili, Baule, Dendi, Chichewa and Khartoum Arabic, which enables further research, collaboration, and development of technology for these languages.
We partnered with Data Scientists Network (DSN) to organize the West Africa Speech Recognition Challenge, which according to Toyin Adekanmbi, the Executive Director of DSN, gave participants an “immersive experience to sharpen their skills as they learned to solve local problems”. Participants worked to train their own speech-recognition model for Hausa, spoken by an estimated 72 million people, using open source data from the Mozilla Common Voice platform.
In the Swahili Social Media Sentiment Analysis Challenge, held across Tanzania, Malawi, Kenya, Rwanda and Uganda, participants open sourced solutions of models that classified if the sentiment of a tweet was positive, negative, or neutral. These challenges allowed participants with similar interests to connect with each other in a supported environment and improve their machine learning and NLP skills.
Our focus to empower people to use technology in the language of their choice continues and, across many teams, we are on a mission to advance language technologies for African languages and increase NLP skills and education in the region, so that we can collectively build a world that is truly accessible for everyone, irrespective of the language they speak.
====