Voice Recognition vs. Speech Recognition • Link Electronics

The terms speech recognition and voice recognition are popping up more and more frequently in news articles and social media. The development of these technologies has given us tools and digital assistants like Amazon’s Alexa, Microsoft’s Cortana, and Apple’s Siri, which have made the content more accessible to everyone. While speech recognition and voice recognition are often used interchangeably, key differences between the two are essential to understand.

Voice Recognition

Voice recognition software can identify one specific voice with training. The training process usually entails the user going through various phrases, for example, “I went to the store to pick up apples, oranges, and bananas.” The software uses these phrases to recognize the speaker, their inflections, and tone of voice. This process is what most digital assistants and voice-to-text apps use.
The software works because:

There is only one speaker
There is limited functionality to the tasks being asked to accomplish
The digital assistant can ask for a repeat of the phrase
It can infer meaning, even if it misses a few words

We’ve all seen or experienced firsthand the usually funny, albeit frustrating, voice-to-text features and functions on our phones. While not all voice recognition systems are the same, there is still a long way to go before you don’t have to worry about calling your ex. “Tom” when you meant to call “Mom”.

In captioning, voice recognition software can be used by a shadow speaker. A shadow speaker is someone who trains a voice recognition program to transcribe audio into captions. The shadow speaker listens to a live audio feed, dictates that audio using a specialized microphone, and then sends the transcribed audio back to an encoder for broadcast. There are some issues with this method of captioning:

If the shadow speaker is sick or their voice changes, the accuracy falls dramatically.
If the shadow speaker cannot keep up with the live audio feed or can’t understand what is being said, the audio isn’t captioned.

Speech Recognition

Where voice recognition learns a specific voice, speech recognition software can identify speech itself. While using speech pattern algorithms and language models, speech recognition can transcribe any speaker without the software being trained to their specific voice. For the highest accuracy, high-quality audio is necessary. To achieve this:

Only one speaker should speak at a time.
There should be minimal background noise.
A high quality microphone is needed.

One application for automatic speech recognition, or ASR, is for automatic captioning, such as the ACE series from Link Electronics. The system uses a state-of-the-art linguistics algorithm to turn speech into caption data. The system typically sees accuracy rates of 95% unless there is poor audio quality. Poor audio could result from:

There is loud background noise.
The speaker is mumbling or covering their mouth.
There is more than one speaker at a time.

While ASR might not be 100% perfect, it allows the use of speech recognition technology without the user spending time training the software. There are ways of improving the accuracy of ASR. For example, with customization, the ACE series can automatically identify individual speakers such as anchors and reporters with a speaker ID. Link Electronics can also create a custom language model to more accurately caption specific words, phrases, names of places, or people unique to your locale.

The Future is all around us

Voice recognition and speech recognition are making their mark in our technology-driven world. According to Gartner, 75% of households in the US will have smart speakers by 2020. That’s an estimated 68% increase since 2017. The market for this industry is estimated to reach nearly $32 Billion by 2025. And it isn’t just in smart speakers or phones that this technology is used. Audio to text software is growing in the healthcare industry, educational facilities, the financial industry, the government, and television broadcast. The potential for this technology is exciting. It allows accessibility to what so many of us take for granted, the ability to hear and see. It can be as simple as understanding what is going on in your favorite TV show to accessing and understanding the emergency weather reports for the storm heading your way. It may not seem like much to some, but speech and voice recognition has and will continue to be an integral part of our future.