Machine learning and AI

Engineering speech recognition from machine learning

Dimitar Kostadinov
July 30, 2021 by
Dimitar Kostadinov

Speech recognition operates on human inputs that allow users to communicate with machines (e.g., computers, smartphones and home assistants) and machines to respond to an implanted voice. 

To work correctly, a piece of software like this should be able to “transcribe” all complexities inherent in human speech, such as voice rhythm, length of speech and intonation.

The main types of speech recognition are “automatic speech recognition” (ASR), “computer speech recognition” or “speech to text” (STT). Voice recognition might be the same technology used for the biometric identification of specific users.

Learn Cybersecurity Data Science

Learn Cybersecurity Data Science

Build your skills using machine learning and other cutting-edge tools to perform various cybersecurity tasks.

What distinguishes humans from robots are emotions; therefore, the voice in a person's speech conveys a semantic message and some emotion. Speech emotion recognition (SER) is a type of speech recognition whose purpose is to establish a speaker’s underlying emotional state by analyzing their voice. Many applications can help detect emotions, some of which have to do with web-based e-learning, audio surveillance, call centers, computer games, clinical studies etc. Popular apps such as Amazon’s Alexa, Apple’s Siri and Google Maps employ speech recognition.

Machine learning (ML) software can make measurements of spoken words through a set of numbers that represent the speech signal.

Key challenges in automating speech recognition

Since emotions are subjective, emotion detection is a challenging task. Mel-frequency cepstrum coefficient (MFCC) is the most popular representation of voice signals’ spectral properties since it considers human perception sensitivity concerning frequencies.

Latency is a key challenge in speech recognition. ML would need to predict words correctly in real-time to complete logically whole sentences. Bi-directional recurrent neural networks are some deep learning models that benefit from having access to an entire sentence due to the added context. In limiting context in the model structure via access to a short amount of information following a specific word, the neural network might reduce the latency.

Speech speed is a significant issue in the context of SR since the same word can be pronounced very quickly or in a slow, drawling way.

There are three essential features in a speech:

  • Lexical features (the vocabulary used): it would require a transcript of the speech based on the text extraction from the speech
  • Visual features (the expressions the speaker makes): it would require access to the video of the conversation
  • Acoustic features (sound properties like pitch, tone, jitter etc.)

Of course, bias is a well-known problem in artificial intelligence (AI). Stanford researchers found out that automated speech recognition systems made twice as many errors when interpreting words spoken by African Americans compared to when interpreting the same words spoken by whites. 

As always, data is the fuel

Feeding sound waves into a computer is the first step in speech recognition. Turning sounds into bits is the way to do it. For social engineering of speech recognition to be successful, the process should include the following steps:

  1. Access to a reliable speech database or voice sample collection.
  2. Extracting effective features: it improves the learning capabilities of the algorithm as the number of features that characterize a dataset are being reduced.
  3. Using ML algorithms to create reliable classifiers: ML algorithms can learn from training samples to classify new observations.

Deep learning is finally applicable to speech recognition, as it is accurate enough to use in everyday environments. Remember that a voice recognition system should operate smoothly, or it might be non-viable.

In reality, everyone who wants to construct a workable voice recognition system needs to use a lot of training data. According to Soniox founder and CEO Klemen Simonic, “Google and Facebook have more than 50,000 hours of transcribed audio. One has to invest millions — more like tens of millions — of dollars into collecting transcribed data. Only then [can one] train a speech recognition AI on the transcribed data.”  

Speech recognition projects from Facebook and Microsoft used labeled and transcribed speech data, the length of which was between 13,100 and 65,000 hours. Mozilla’s Common Voice — a public voice database — has some 9,000 hours of recordings.

Speech recognition software: For good and bad

Data is the new oil or gold. You’ve heard that, right? It is not surprising that Google Now! or Siri comes cheap or even free — the goal is to make people use them as much as possible so the speech data they submit, wittingly or not, is recorded forever. It could be re-used as training data for speech recognition algorithms. Namely, the usage of tons of personal speech data is what separates the world-class speech recognition system from more specific private speech recognition projects.

In addition, the common presupposition is that one can get a pile of data, feed it to the ML algorithm and nurture a top-notch AI system. That may work sometimes, but not for speech. Machines come across all kinds of troubles when recognizing speech: background noises, echo, bad quality microphones, different accents and more. 

Soniox claimed that by using large quantities of unlabeled audio and text to teach its algorithms to distinguish speech with accents from background noises, its speech recognition system could accurately convert 24% of words compared to other speech-to-text systems.

For speech recognition, cybercriminals could use Soniox as an example of how to use “unknown” data for which no predefined label exists to train their ML software to classify, process the data and learn its structure.

Semi-supervised learning is a technique where partially labeled data can be fed to ML to produce state-of-the-art results in speech recognition. However, for fine-tuning of a particular model in a specific dataset to be initiated, social engineers would need an actual transcribed audio dataset. 

Automated speech recognition security and privacy implications

Brett McDowell, executive director of the FIDO (Fast IDentity Online) Alliance, stated that “voice recognition is vulnerable to a presentation attack; where the adversary records a sample of the targeted user's physical characteristics and uses that to produce an imposter copy or ‘spoof’ of that user's biometrics. We leave fingerprints on most everything we touch, and both our images and voices are easily recorded without our knowledge or permission.”

An investigation by Raphael Satter for the Associated Press revealed how businesses, banks and even governmental institutions are quietly rolling out customer voiceprints. As the investigator explains, the term's name derives from the fact that every person’s voice is unique in a way similar to their fingertips.

Security researchers from Zhejiang University in China invented a way to activate voice recognition systems without speaking a word simply by employing high frequencies inaudible to humans but registerable on electronic microphones. You could prevent the attack if you turn off wake phases or restrict some functions on your devices.

Hello Barbie is a smart doll that utilizes the combined power of ML and AI to produce logical conversations between her and her clients. Security researchers warned that Hello Barbie could act as a surveillance device by listening to a family's conversations. This is possible due to natural language processing and advanced analytics that make sense of 8,000 lines of dialogue pre-recorded via a microphone on the doll’s necklace and stored on the company ToyTalk’s servers.

Learn Cybersecurity Data Science

Learn Cybersecurity Data Science

Build your skills using machine learning and other cutting-edge tools to perform various cybersecurity tasks.

The ills of speech recognition

Speech recognition is ubiquitous; it is invading our lives because it is all around us — built into our phones, smartwatches, other smart objects, game consoles and even automating our homes. You could call a taxi these days by speaking out loud the command to your $50 Amazon Echo Dot.

Scholars see a great future ahead of speech recognition technologies in communications, health, security, tourism and more. For example, speech recognition programs are beneficial to people with visual and hearing disabilities; however, we will probably see whether the pros will outweigh the cons.


4 Security Vulnerabilities That Affected Voice Recognition Technology, HackRead

Automatic Speech Emotion Recognition Using Machine Learning, IntechOpen

Deep Learning for Speech Recognition, Medium

Hackers send silent commands to speech recognition systems with ultrasound, TechCrunch

Machine Learning is Fun Part 6, Medium

Speech Emotion Recognition (SER) through Machine Learning, Analytics Insight

Speech Recognition in Machine Learning, What After College

Speech Recognition-Problems-Security, Speech Recognition Information

Soniox taps unsupervised learning to build speech recognition systems, VentureBeat

Automated speech recognition is more likely to misinterpret black speakers, Stanford University

Top 9 Machine Learning Applications in Real World, DataFlair

Vocal theft on the horizon, CSO

Dimitar Kostadinov
Dimitar Kostadinov

Dimitar Kostadinov applied for a 6-year Master’s program in Bulgarian and European Law at the University of Ruse, and was enrolled in 2002 following high school. He obtained a Master degree in 2009. From 2008-2012, Dimitar held a job as data entry & research for the American company Law Seminars International and its Bulgarian-Slovenian business partner DATA LAB. In 2011, he was admitted Law and Politics of International Security to Vrije Universiteit Amsterdam, the Netherlands, graduating in August of 2012. Dimitar also holds an LL.M. diploma in Intellectual Property Rights & ICT Law from KU Leuven (Brussels, Belgium). Besides legal studies, he is particularly interested in Internet of Things, Big Data, privacy & data protection, electronic contracts, electronic business, electronic media, telecoms, and cybercrime. Dimitar attended the 6th Annual Internet of Things European summit organized by Forum Europe in Brussels.