If you haven’t already, subscribe and join our community in receiving weekly AI insights, updates and interviews with industry experts straight to your feed.
The race to build conversational voice AI is speeding up. OpenAI, Google DeepMind, Microsoft AI, and dozens of startups are competing to make AI sound more natural, emotional, responsive, and human. We’re moving rapidly from typing at machines to speaking with them.
And hidden within the excitement there’s an important cultural question: what happens when the systems learning to ‘hear’ humanity are trained unevenly across humanity itself?
One of the most influential studies on speech-recognition bias analysed five leading automated speech-recognition systems developed by major technology companies. Researchers found substantially higher word error rates for Black speakers compared with white speakers across all five systems – 35% compared with 19%.
The researchers concluded that speech-recognition systems “may exhibit substantial racial disparities” because of differences in the data used to train them.
The findings highlighted a core issue in AI development: speech systems learn from data distributions. When certain accents, dialects, or speech communities show up less frequently in training datasets, performance drops for those groups in real-world use.
Recent research has also identified accent-related disparities in synthetic AI voice systems. A 2025 study examining services including Speechify and ElevenLabs found uneven representation across five regional English-language accents – and researchers warned that synthetic voices can contribute to forms of digital exclusion.
Voice AI is moving steadily into everyday digital services – from assistants and transcription tools to translation and accessibility products. As this happens, performance gaps make it more difficult for some people to access those services.
The next wave of research gets even more interesting – because AI is now generating speech, as well as recognising it.
Studies into synthetic AI voices have found that some participants felt generated voices failed to reflect their identity, community, or cultural background accurately. Several participants described AI-generated accents as generic, or emotionally disconnected from how people in their communities actually speak.
This opens up a much deeper conversation about language and identity in the AI era.
An accent carries so much:
The way we speak gives clues about where we come from and who we feel connected to.
Historically, dominant accents were seen as more socially valuable than others. And AI systems can amplify those patterns globally when models optimise around the speech patterns most represented in training data.
Scale changes everything here.
Human bias tends to spread gradually through institutions and social systems. AI systems can distribute the same speech norms across billions of interactions every day.
Alongside the development of voice AI, the world is facing a wider linguistic challenge.
UNESCO’s multilingualism initiatives stress that low-resource, Indigenous, and endangered languages need stronger digital support to participate fully in the digital era. Many languages currently have limited archival material, and very little AI-compatible speech or text data.
AI models rely heavily on huge volumes of digital language data. So languages with limited online content are far less present in advanced AI systems – creating a new form of digital inequality.
Communities without strong digital language representation often receive:
And this is where the story gets more nuanced – because AI also offers extraordinary opportunities for preservation.
Google’s Universal Speech Model (USM) was designed to support speech recognition across more than 100 languages using a massive multilingual speech dataset. According to Google Research, the model was trained using more than 12 million hours of speech data and 28 billion text sentences as part of the company’s broader 1,000 Languages Initiative.
Mozilla’s open-source Common Voice project is pursuing a similar goal by crowdsourcing speech data from volunteers around the world to improve language diversity in AI training datasets.
Researchers are also exploring how AI can help:
It’s possible that for some communities, AI tools could support the creation of large-scale digital archives of spoken language and cultural history. We think that’s an incredible use case for AI.
The future of AI communication will be influenced by the voices that are heard clearly, the speech patterns that become dominant, and by whether we treat linguistic diversity as something valuable enough to take care of.
The internet has always influenced how humans communicate – and voice AI could influence how future generations sound.
Open this newsletter on LinkedIn and tell us what you think: should AI systems adapt to human linguistic diversity, or will humans gradually adapt the way they speak to AI?