Machine learning and AI

Engineering voice impersonation from machine learning

Dimitar Kostadinov
July 23, 2021 by
Dimitar Kostadinov

Text-to-speech (TTS) synthesis is the computer’s way of transforming text to audio. Most popular AI-driven personal assistants rely on TTS software to generate as natural-sounding speech as possible. Automation can happen once the computer performs the TTS "fluently" by pulling together words and phrases from pre-recorded files.

Learn Cybersecurity Data Science

Learn Cybersecurity Data Science

Build your skills using machine learning and other cutting-edge tools to perform various cybersecurity tasks.

How is voice impersonation technology used?

Google voice cloning and generative adversarial networks

Voice cloning is AI research from Google that allows a computer to read out loud messages using any voice. The system requires two inputs:

  1. A text to be read
  2. A sample of the voice

Generative adversarial networks (GANs) can capture and modulate a voice signal's audio properties. Open platforms such as WaveNet by Google apply GANs to create media that mimic voices and facial expressions to the extent that they become almost indistinguishable from how the impersonated person sounds and looks.

 As a rule of thumb, the voice-modeling technology improves the more you feed it with voice data. Nevertheless, advanced neural networks sometimes do not need to use a large dataset of recorded audio to pre-train the model.

Lyrebird AI

Tech companies such as the Canadian Lyrebird strive to design an AI system that can mimic a human voice convincingly by analyzing related speech recordings and the corresponding text transcripts. Lyrebird's system relies on the deep learning capabilities of its artificial neural networks to transform bits of sound into speech. 

Once the system manages to learn how to generate speech, it can then calibrate its settings to resemble any voice after reading through a one-minute sample of someone's speech. At the moment, speed comes with trade-offs, such as a buzzing noise accompanying the generated voice, and there is a slight but noticeable robotic mannerism with no vestiges of physical attributes common in natural speaking, such as breathing and mouth movement.

Other voice impersonation technologies

Github's Real-Time Voice Cloning Toolbox promises to replicate anyone's voice from as little as five seconds of sample audio. Adobe already has a prototype platform called Project VoCo, which after listening to 20 minutes of sample audio, can edit human speech the same way Photoshop modifies digital images.

Synthesized voices of Donald Trump, Barack Obama and Hillary Clinton infused with emotion are the proof we all need that this train is heading in a direction that may further exacerbate security and privacy problems before anything else positive occurs.

To understand how good the voice impersonation technology has become, you could check out a demo by an AI company called Dessa, in which they used text-to-speech deep learning techniques to recreate Joe Rogan's voice. Facebook engineers also created a machine learning system named "MelNet" that successfully replicated the voices of famous people, public speakers and other participants.

Security and privacy concerns related to voice impersonation

"Compared to text, voice is just much more natural and intimate to us," said Timo Baumann, a researcher who works on speech processing at the Language Technologies Institute at Carnegie Mellon University. We now live in a world where whoever has a digital imprint of your voice can master its impersonation at will.

Consider also that every device equipped with a voice assistant is pre-programmed to listen quietly for a "wake word" to emerge out of a continuous stream of audio. This process is also typical of how a voice is fed to a machine learning model.

Ambient voice, comparable to hands-free AI-based technology, has practical application in healthcare in cases where medical specialists record verbal interactions with visiting patients. Doctors, among others, see this technology as an excellent tool for alleviating the daunting, bureaucratic task of typing any information needed for physical documentation related to medical records.

However, the question of privacy remains a hot potato in these situations. Given that AI is being used in retail environments more often than ever, there is this ongoing discussion about transparency: Should customers be notified when they engage with AI?

In addition, voice impersonation may entail some serious negative consequences, which are for the most part security-sensitive:

  • May confuse voice-based verification systems
  • May bring into question the integrity of real-time video in live streams
  • May render audio and video recordings unusable in court evidence 

While automatic speaker verification systems are good at detecting human imitation, they often fail to spot more advanced machine-generated voice impersonation attacks.

No wonder that Lyrebird's founders (three university students from the University of Montréal) openly admitted, "This could potentially have dangerous consequences such as misleading diplomats, fraud and more generally any other problem caused by stealing the identity of someone else."

Real-life implications of ml-based voice impersonation

Impersonation is the largest scam category reported to the FTC, with more than 647,400 complaints in 2019 alone. Some of these cases are AI-related.

Voice fraud as a whole is becoming more popular. One report stated that they have increased by 350% in the past few years. Another research prognoses that as many as 50% of all mobile calls conducted on U.S. soil by next year will be fraudulent.

Social engineers have many sources to draw inspiration from — voicemail greetings, social media, data breaches, visited websites and more. Companies tend to let out recordings of their high-ranking employees' actual voices — a practice that can, unfortunately, create a fake recording from the upper management. Seventy-five percent of targeted victims share those bad actors already had some personal information about them. Scammers use additional techniques to deceive the victim, such as spoofing area codes, so it appears the call is made from the area that the victim expects it to originate.

At least three recent attacks have taken advantage of deepfake voices to swindle companies out of millions of dollars. In one case, $10 million, according to Symantec CTO Hugh Thompson.

A common scenario: "Please transfer money to this person. It is urgent."

The story of Gary Schildhorn, a 67-year lawyer, is indicative of how voice impersonation is used for social engineering. While driving to work, Schildhorn received a phone call from his son — or at least sounded just like his son.

"It was his voice, his cadence, using words that he would use," said the lawyer. The crying voice on the phone explained he had been in an accident and needed $9,000 to pay for a public defender. In 10 minutes, Schildhorn received another call from someone who claimed to be his son's lawyer — a move that should have further fostered the swindle. Schildhorn almost reached his bank to order the payment, but before that, he called his daughter-in-law, who alerted his son's work; eventually, his son called to tell him not to pay because it was a scam.

Perhaps the most famous case of an AI-based voice impersonation social engineering attack is when the CEO of a UK energy firm wired €220,000 ($243,000) because he thought he was speaking on the phone with his boss. By his testimony, he recognized his German accent and melody of voice.

Experts say that voice impersonation and deepfake attacks are the logical evolution of the business email compromise scam where scammers impersonate company executives via email. David Thomas, CEO of identity verification company Evident, told Threatpost that "it's no longer enough to just trust that someone is who they say they are. Individuals and businesses are just now beginning to understand how important identity verification is. Especially in the new era of deep fakes."

Although somewhat burdensome, dual custody is a measure that might work to prevent these kinds of frauds. Whenever transactions above a specific size are involved, two or three co-signatories should be required.

Learn Cybersecurity Data Science

Learn Cybersecurity Data Science

Build your skills using machine learning and other cutting-edge tools to perform various cybersecurity tasks.

Voice engineering and its threat

Voice impersonation is another technology that blurs the line between the physical world and cyberspace.

It may not be a pervasive technology yet, but it's a disturbing technology that raises ethical questions about misuse on a much larger scale soon.

Just as people cannot entirely trust whether an image is doctored via programs such as Photoshop, they should learn not to entirely trust what voices they hear through their electronic devices.


Dimitar Kostadinov
Dimitar Kostadinov

Dimitar Kostadinov applied for a 6-year Master’s program in Bulgarian and European Law at the University of Ruse, and was enrolled in 2002 following high school. He obtained a Master degree in 2009. From 2008-2012, Dimitar held a job as data entry & research for the American company Law Seminars International and its Bulgarian-Slovenian business partner DATA LAB. In 2011, he was admitted Law and Politics of International Security to Vrije Universiteit Amsterdam, the Netherlands, graduating in August of 2012. Dimitar also holds an LL.M. diploma in Intellectual Property Rights & ICT Law from KU Leuven (Brussels, Belgium). Besides legal studies, he is particularly interested in Internet of Things, Big Data, privacy & data protection, electronic contracts, electronic business, electronic media, telecoms, and cybercrime. Dimitar attended the 6th Annual Internet of Things European summit organized by Forum Europe in Brussels.