The Rise of Voice Assistants and Smart Speakers

The Japanese and Korean Languages: Similarities and Differences
January 29, 2020
Differences Between European French vs Canadian French
February 12, 2020

This past decade we saw explosive growth in voice assistant technology. Voice assistants are software such as Alexa, Google Assistant, Siri, and others, that are able to respond to or perform actions based on voice commands. Smart speakers are hardware like Amazon Echo, Google Home, and Apple HomePod, which have built-in voice assistants and a microphone and speaker to listen to and playback sound.

Voice assistants use text-to-speech, voice recognition, and natural language processing (NLP) to recognize and respond to voice commands. The technology is creating new opportunities in the voice industry, but it is also supplanting some voice work that was originally only done by humans. In this blog, we will look at the history of voice assistants, how they work and are used, as well as some of the ethical/economic implications they may pose to society.

[Average read time: 4 minutes]

space gray iPhone X

A Short History of Voice Assistants

As we discussed in a previous text-to-speech blog, the first modern predecessor of the voice assistant/smart speaker was Bell Labs “Audrey” in 1952, which could only recognize numbers. Now voice assistants can recognize up to 95% of human speech (in U.S. English) and are readily available to consumers from $50 to $400.

Compare this to 1990, when the first voice assistant for consumers, Dragon Dictate, cost $6,000. In 2001, Microsoft incorporated speech recognition into it’s Office XP software, however, it’s speech-to-text feature still was unable to differentiate close sounding words like “recognize speech” versus “wreck a nice beach.”

Ten years later, in 2011, Apple introduced Siri on the iPhone 4S which was able to read and reply to messages, schedule appointments, and more. This revolutionized how humans and electronic devices interacted. Though voice recognition and artificial intelligence had been around for a while, they lacked daily ease of use. Siri changed that, being readily available on a mobile phone at the touch of a button. However, as many early iPhone users know, Siri still had issues accurately responding to voice commands. But Siri was just the start, a slew of competing voice assistants entered the market over the next decade such as Google Now in 2012 and Microsoft’s Cortana in 2014.

The smart speaker revolution sparked off in 2015, when Amazon officially launched its Amazon Echo with their built-in voice assistant, Alexa. Echo gave users a home device that they could command to play music, set up timers, schedule reminders, and more…solely through voice. Amazon Echo was a huge hit and now other companies are looking to compete in the smart speaker market: Google Home with Google Assistant (2016), Apple HomePod with Siri (2018).

Demand is driving the innovation and development of voice assistants and smart speakers: last year in 2019, it was estimated that 1 in 3 Americans, 111.8 million people, use voice assistants regularly. This number is expected to jump to 122.7 million people in 2021, 36.6% of the US population.

photo of Amazon Echo Dot

Recording Voice Assistant Voices and How They’re Used

Siri was originally voiced by voice actor Susan Bennett and was created through the use of “concatenation”, a process in which a human voice recording is cut up into snippets and then piece-mealed together to create sentences.

When Microsoft launched their virtual assistant in 2014, they decided to use the name “Cortana”, based on the holographic assistant from the Halo video game series. Microsoft hired the game’s voice actress Jen Taylor, who plays Cortana in the game, to help develop Cortana’s voice as well as provide original voice-over to key “chit chat” responses to Halo¬†related questions.

To develop a voice assistant’s voice, a voice actor is hired to read out loud about 10-20 hours of text, whose audio is recorded (though the number of hours needed is steadily decreasing). Then through the process of natural language processing and deep learning, the computer program analyzes the phonemes (distinct units of sound) and speaker prosody (rhythm of speech) in the human recordings so that it can mimic the voice and create it’s own sentences. The result is a human-like artificial voice. Some voice assistants can respond in different languages such as Mandarin Chinese, Japanese, Spanish, etc… which means that voice actors that are native speakers in those languages are hired to record voice-over so that the voice program can learn the distinct characteristics of that language.

As voice assistants become more global, they’re also taking on more responsibilities: from turning on the lights, locking the doors, and buying gifts to offering customer service support or self-checkout at food and retail stores. These devices are becoming more enmeshed in our lives, whether we like it or not, and it seems that regulations and ethical conversations about this relatively new technology are still catching up.

robot holding frame

Concerns Over Voice Technology

While AI voices become more and more human-like, the question becomes: should a robot sound like a human? As mentioned in our deepfakes blog, being tricked by an AI voice to think that you’re listening to a politician or speaking with a loved one is a very real threat. As of yet, legislation regarding the regulation of AI and its uses is lagging behind the speed of development.

Privacy concerns over the potential recording of voices, particularly childrens’ voices, without consent has caused some to avoid voice assistants or steer towards more privacy-based voice assistant devices. Companies are also learning along the way, developing their own policies for reviewing voices.

For now, the vast number of voice assistants sound robot-like and there’s still a large market for human voice-over in commercials, audiobooks, e-learning and much more. Listening to an AI voice for hours on end can lead to listener fatigue and humans still connect better with a human voice.

But as more and more “voiceprints” (similar to a fingerprint, but for voices) are recorded and analyzed and replicated, voice actors are in a way feeding a technology that is learning their job and the (formerly) human work of other industries. We’ve seen this in automated customer service phone response systems, self checkout with voice capabilities, and maybe soon human personal assistants.

This past decade has seen great developments in voice technology, what will this new decade bring? We’ll see. Hopefully us humans can keep up.

See how text-to-speech can save your company money. Click below for your free e-book!

Download “7 Myths of Audio & Video Translation,” JBI Studios’ indispensable guide to audio translation and dubbing.