How to Record Multilingual Voice-Over Sets for Voice Recognition

October 25, 2017
3-times-in-studio-interpretation-is-a-great-voice-over-option
4 Times In-Studio Interpretation Is a Great Voice-Over Option
October 18, 2017
4-tips-to-ensure-successful-localized-green-screen-video-productions
4 Tips to Ensure Successful Localized Green-Screen Video Productions
November 1, 2017

Voice recognition is a key component of the next generation of technology, from virtual reality and artificial intelligence, to the home and mobile assistants that people use every day. And the surge in voice recognition has required creating large sets of voice-over audio recordings, which are used to “teach” systems to recognize speech. As voice recognition support in foreign languages has surged, so has the need for multilingual audio sets – and naturally, developers are turning to audio & video localization professionals to create them.

This post will list what you need to know to record the foreign-language audio sets used to calibrate voice recognition systems.

[Average read time: 4 minutes]

Voice recognition is ubiquitous

The technology is used on everything from Siri to Cortana, to automated phone systems, to the more cutting-edge virtual reality. In fact, it’s a kind of raison d’être for some products – without voice recognition, the home assistants put out by Google, Amazon and Apple would just be Bluetooth speakers, though very high-quality ones. The technology is advancing at a staggering pace, especially in the English language – in fact, Microsoft announced in August that its speech recognition technology can now transcribe conversational speech about as well as humans, with some caveats. (More on Microsoft’s Research Blog.)

How did they do this? In part, by calibrating their system against a massive bank of audio recordings of conversational human speech, gathered from a wide range of voices. Producing these recordings was crucial to the development of voice recognition systems in English. Equally so to the development of high-quality systems in foreign languages, which is already well underway in some locales. That’s where multimedia localization professionals come in.

Recording the voice-over audio “training” sets

The workflow is quite different from recording professional foreign-language voice-over for corporate, e-Learning, marketing or entertainment recordings. But some of the principles – like rigorously checking for non-native accents – are still the same. Let’s look at the process in depth.

1. Go non-pro.

For the recordings to be useful for voice recognition specifically, they have to be an accurate sampling of how people speak. A professional voice talents’ perfect diction, phrasing and microphone technique will actually be detrimental in this case. So it’s crucial to cast non-professionals – often people who have never been inside a recording booth.

This means that voice casting has to happen in publications and boards that may not be directed specifically at actors – for example, Craigslist instead Backstage. It also means that the recordings will take longer than they would with professional voice actors.

2. Get native speakers – but also look at market segments.

It’s somewhat obvious, but bears repeating – you must source native speakers of each particular language. If you’re creating a Japanese voiceover audio set, for example, the presence of non-native accents may throw off the system calibration – meaning the recordings are useless.

3. Get accent spread for locale, but look at populations first.

These recording should also reflect the overall speaking patterns of a population. If you’re creating a Spanish voiceover recording set for Mexico, for example, you won’t want all of your speakers to be from one region of the country. Rather, you’ll want audio that represents the linguistic variation of each locale as closely as possible – in some locales this will mean a wider accent spread, while in some a much narrower one.

4. That said – look at non-native populations if they are demographically significant.

Getting the proper accent spread may mean recording non-native speakers, especially if your locale has a significant number of them. For example, Los Angeles has very large populations of non-native speakers, and creating an English voiceover recording set for this region may benefit from including these accents. Again, it all depends on the linguistic variation of a particular locale, as well as the client’s user base, which you want to capture in all its diversity.

voice-over-recording-pools-for-recognition-systems-require-accent-and-locale-spread.jpg

 

5. Hire a fully native & bilingual director to oversee the casting

You can’t do the previous 3 items without a native-speaker director from each locale, one who’s also fully fluent in English. They’ll be able to pin-point not only the native accents, but also the variations in certain locales. Directors should be standard on any multimedia localization project – naturally, JBI Studios provides one on every recording.

6. Prepare for non-standard recording setups and requirements

Some clients will require you to replicate a particular scenario – for example, patching the recordings through a mic setup that mirrors how users interact with their products. Some may want you to record into the product itself – for example, into an app online – and still patch it through a professional recording studio. And many clients require data collection – getting biographical and demographic information from your subjects, and tying it to their voice recording as metadata.

No matter the client, be prepared for additional work during casting, studio setup and recording.

Address client requirements & test your workflow

Take the time to understand your customer’s specific needs before you begin sourcing voices and recording – on everything from linguistic variation, to file format, to audio recording and post-production specs. Voice recognition technology is relatively recent, and while there are general best practices for the creation of voice sets, every developer calibrates their system differently. Moreover, each language will have unique requirements in terms of locale, accent spread, prevalence of non-native accents, and a host of other factors. It’s imperative to test workflows and recording setups, and get feedback from a client implementation run-through. Thorough preparation is the key to multimedia project success in general, but when recording audio sets for voice recognition, it’s the only way to make sure your clients will be able to use the audio that you’ve spent weeks, or even months, producing.

Learn how JBI Studios work

rsz_image_how-did