With Mozilla

By Pavan Rauch

Pavan’s build4good experience

This summer, I interned with The Mozilla Foundation to help automate methods for standardizing free-form accent descriptions in Mozilla's Common Voice dataset.

About the internship

Context

The Mozilla Foundation is a non-profit dedicated to making digital technology open and accessible. One of their projects is Common Voice, a platform that crowdsources datasets of transcribed speech from over 130 languages. The project is open to the public, meaning speakers can record audio clips on the platform and researchers can download the data for free. Common Voice's goal is to make speech technology accessible to those who do not speak a dominant language and to directly involve language communities in that process.

Common Voice allows contributors to define their accent on their account page. The accent label is then stored with each of their audio recordings in the dataset. This accent data aims to mitigate  the accent bias of speech technologies, which currently perform poorly when speakers have a nonstandard accent.

The Challenge

The Common Voice website allows users to either select their accent from a predefined list or describe it themselves in a free-form text box. The free-form text is necessary, as no list of accents can ever cover all the variation in real-world accents. However, letting users write their answer without any constraints means that two speakers who share an accent will usually have different textual descriptions in the dataset. This makes it difficult to group speakers together by accent. Right now, Common Voice's accent data is going unused because it is impossible to train machine learning models on it.

My goal was to investigate different ways of automatically grouping the free-form accent descriptions into a consistent format so that researchers will be able to use accent labels as a feature.

My Solution

Before implementing these approaches, I had to define what their outputs would look like. I needed each algorithm to group the accent descriptions in its own way and then store that grouping in a format that was the same for all algorithms. This would make it easy to compare the algorithms' outputs against each other. My solution was to represent groupings as a system of tags, where each tag on an audio clip represents a single aspect of the accent in that clip and no two tags in the system have the same meaning. The tags are then organized into a hierarchy of super- and sub-accents; for example, "Midwest United States" is a sub-accent of "United States". Using this model, I assigned tags to the English and Catalan datasets by hand to create a baseline set of labels. In order to consistently evaluate the quality of automatically generated tags, I developed an evaluation algorithm that estimates how similar an automatic result is to the baseline.

I started by researching techniques for grouping short texts by their meaning. I identified three main approaches to explore: clustering, normalization, and classification.

Once the evaluation method was in place, I moved on to implementing clustering, normalization, and classification. I used a variety of techniques, including pre-trained language models and external knowledge sources. I programmed each algorithm as a series of interchangeable steps in a pipeline, which allowed me to easily adjust the way that text was being processed and grouped. I evaluated the different algorithms and their variations and presented the results to Mozilla. The project was conducted in Python and made use of various data mining techniques, libraries, and resources.

A group of young adults smiling for a photo

What stands out to me about my build4good project is the way it was uniquely tailored to my skills and interests. The project took advantage of my existing data and research skills and then built on top of them, grounding my abilities in experience. I am more confident with data science after completing the internship and I was able to especially develop the parts of my skillset that are unique to me. Improving these skills while working on a sociolinguistic problem -- a domain I've always been interested in but never had access to -- was an experience I am incredibly grateful for. I do not think I could have worked on that kind of project this early in my career without build4good's matching program.

Reflection

The summer also brought many opportunities to meet others who care about making technology that helps people. Working with the Mozilla Foundation and learning about the Common Voice project was incredible; talking to professionals whose work has real impact inspired me to develop goals for my own career path. build4good's guest speakers prepared me for that path by teaching me what public impact careers can look like in practice. Most of all, I enjoyed spending time with the other interns in Washington, D.C. It is rare to find other technologists with the same passions, and I really valued getting to know everyone's unique backgrounds.