AI speech generator 'reaches human parity' — but it's too dangerous to release, scientists say
Microsoft's VALL-E 2 can convincingly recreate human voices using just a few seconds of audio, its creators claim.
Microsoft has developed a new artificial intelligence (AI) speech generator that is apparently so convincing it cannot be released to the public.
VALL-E 2 is a text-to-speech (TTS) generator that can reproduce the voice of a human speaker using just a few seconds of audio.
Microsoft researchers said VALL-E 2 was capable of generating "accurate, natural speech in the exact voice of the original speaker, comparable to human performance," in a paper that appeared June 17 on the pre-print server arXiv. In other words, the new AI voice generator is convincing enough to be mistaken for a real person — at least, according to its creators.
"VALL-E 2 is the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time," the researchers wrote in the paper. "Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases."
Related: New AI algorithm flags deepfakes with 98% accuracy — better than any other tool out there right now
Human parity in this context means that speech generated by VALL-E 2 matched or exceeded the quality of human speech in benchmarks used by Microsoft.
The AI engine is capable of this given the inclusion of two key features: "Repetition Aware Sampling" and "Grouped Code Modeling."
Sign up for the Live Science daily newsletter now
Get the world’s most fascinating discoveries delivered straight to your inbox.
Repetition Aware Sampling improves the way the AI converts text into speech by addressing repetitions of "tokens" — small units of language, like words or parts of words — preventing infinite loops of sounds or phrases during the decoding process. In other words, this feature helps vary VALL-E 2's pattern of speech, making it sound more fluid and natural.
Grouped Code Modeling, meanwhile, improves efficiency by reducing the sequence length — or the number of individual tokens that the model processes in a single input sequence. This speeds up how quickly VALL-E 2 generates speech and helps manage difficulties that come with processing long strings of sounds.
The researchers used audio samples from speech libraries LibriSpeech and VCTK to assess how well VALL-E 2 matched recordings of human speakers. They also used ELLA-V — an evaluation framework designed to measure the accuracy and quality of generated speech — to determine how effectively VALL-E 2 handled more complex speech generation tasks.
"Our experiments, conducted on the LibriSpeech and VCTK datasets, have shown that VALL-E 2 surpasses previous zero-shot TTS systems in speech robustness, naturalness, and speaker similarity," the researchers wrote. "It is the first of its kind to reach human parity on these benchmarks."
The researchers pointed out in the paper that the quality of VALL-E 2’s output depended on the length and quality of speech prompts — as well as environmental factors like background noise.
"Purely a research project"
Despite its capabilities, Microsoft will not release VALL-E 2 to the public due to potential misuse risks. This coincides with increasing concerns around voice cloning and deepfake technology. Other AI companies like OpenAI have placed similar restrictions on their voice tech.
"VALL-E 2 is purely a research project. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public," the researchers wrote in a blog post. "It may carry potential risks in the misuse of the model, such as spoofing voice identification or impersonating a specific speaker."
That said, they did suggest AI speech tech could see practical applications in the future. "VALL-E 2 could synthesize speech that maintains speaker identity and could be used for educational learning, entertainment, journalistic, self-authored content, accessibility features, interactive voice response systems, translation, chatbot, and so on," the researchers added.
They continued: "If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model."
Owen Hughes is a freelance writer and editor specializing in data and digital technologies. Previously a senior editor at ZDNET, Owen has been writing about tech for more than a decade, during which time he has covered everything from AI, cybersecurity and supercomputers to programming languages and public sector IT. Owen is particularly interested in the intersection of technology, life and work – in his previous roles at ZDNET and TechRepublic, he wrote extensively about business leadership, digital transformation and the evolving dynamics of remote work.