We’re in the midst of a technological revolution powered by artificial intelligence, and voice is at the center of this sea of change. In the 1920s and 30s, the introduction of radio allowed voices to reach every home and changed what we listened to. Now voice technology is allowing individuals to control a multitude of devices through voice commands and is changing what we speak to.
These devices (smart speakers, thermostats, lights, cameras, etc…) also speak back to us. This technology has created a unique market: recording human voices for robots. Many companies from different industries work with localization studios to record prompts and responses for their devices in multiple languages. Some also use voice recordings to teach their voice software to sound more human.
Voice-assisted devices generally use a combination of recorded prompts from a voice actor and a synthesized voice that uses machine learning. In this blog we will go over some common practices and methods for recording voice for both types.
[Average read time: 4 minutes]
Certain devices have a limited selection of voice prompts and therefore do not need machine learning to create different sentences (see the next section). Instead, voice actors are hired to record all the voice prompts that are spoken when specific actions happen (e.g. if a car gets too close to another car: “Warning, front bumper is too close to an object”).
Voice-assisted devices in the early years used female voices but since then have expanded to include male and gender-neutral voices in various accents and languages. Some companies have also launched collaborations with celebrities like Samuel L. Jackson and John Legend to use their voices in smart devices to appeal to the celebrities’ fan bases. Others may not have the budget to record the voices of various celebrities, but it is important that they record voices that will reach their intended local audiences.
To make sure a voice-assisted device is localized properly, voice prompts for devices should be recorded by native speakers of a given language with the appropriate accent. For instance, if the device will be used in various English-speaking countries, it would be important to record English speakers who speak the standard accent for each target country. Examples: the general American accent for U.S.A., standard British accent for England, standard Australian accent for Australia, an Indian accent for India (where English is spoken generally with a Hindi-influenced accent). To give more options to the customer, recording both male and female voices in these accents would be helpful.
Another thing to keep in mind when recording is the voice brand of the company. In the same way thought is given to the visual representation of the company through posters, logos, and other graphics, similar considerations should be applied to how the company wants to portray their brand to customers through voice.
For instance, if the device is a children’s toy, the voices are generally bright and cheerful (like Woody from Toy Story). If the device is a smart doorbell, it might be helpful for it to sound professional and firm (e.g. “Please leave the package at the front door”). Knowing the personality of the voices will play a big role in how voice actors are cast and directed in the studio.
In speech synthesis, a computer reads words out loud through a text-to-speech feature. Synthetic voices are available in multiple languages and once already created–are cheap and fast to use to make new recordings. However, without learning from human voices, they still sound robotic. To mitigate that artificiality, voice actors are hired to record approximately 10-20 hours of audio, which is then analyzed by speech-learning software. The audio recorded by the voice actors can be a mix of book passages, news articles, as well as nonsensical phrases like “oil your mills jewel weed today” which includes a variety of vowel-consonant sounds that can be re-arranged by a computer to create other sentences that the voice actor never recorded.
Machines use neural language processing to listen to human voice recordings and learn the particular sounds of a language like the “p” sounds in varying words like “pad, pitch, punch.” Therefore, voice actors recording audio for this purpose need to make sure that their enunciation and elocution are particularly clear so that the programs can pick up the distinct sounds.
Recording sessions for machine learning can be grueling for the voice actor, director, and sound engineer. It’s important that breaks are taken when needed so as to keep the recordings clear and fresh. Also, scheduling goals have to be communicated: how much content must be read in what amount of time. This will help the voice actor keep a consistent pace, without being too slow or fast, in order to get through a large amount of material efficiently without sounding rushed. A major issue with machine voices is that they have an artificial rhythm. Making sure voice actors record with a natural cadence will help programs learn language prosody (rhythm of speech). A director should generally be in the studio to make sure the actor is sounding natural.
The recording process is repeated with different male and female voice actors, giving companies multiple voice options and also the ability to potentially mashup voices to form entirely new voices. For machine learning, the more recordings available, the more human sounding the voices will be. Machine learning is not fully automated, however. Linguists and programmers still have to fine tune the speech to give it that extra human touch. Their job is to help the software learn context (“wind”-up-toy vs. the “wind” is blowing) as well as emotional and social cues.
When recording voice prompts for devices, it should be done by native speakers of the target language with the local accent and should reflect the company’s brand. Also, recording different genders will give customers more options.
For speech synthesis: through a wide range of human recordings, software–with the help of a team of linguists and programmers–can learn the phonemes, accent, and context of a particular area’s language. These devices can than create human-like speech that was never recorded by the original voice actors.
This type of technology brings up many ethical questions that we have mentioned in previous blogs: do we really want our devices to sound human? Can this technology be used against us?
One compromise could be that in situations where it’s not clear if one is speaking with a device or a human–such as automated voice response systems–some companies have made sure to include a prompt “I’m an automated voice system.”
As the technology develops, how we navigate these ethical waters will have to develop as well.
Download our free e-book below to see how text-to-speech can help your company save money.