Text-to-speech (or TTS) is one of the big news stories of 2017. It’s a crucial part of the success of Google Alexa and Amazon Echo, both voice-operated home assistants. Its maturation as a technology will be a key component to the development of artificial intelligence. And, of course, it’s going to radically change video translation services – in fact, it’s doing so already.
This post will look at how text-to-speech can be used for video translation, using video examples created by our team.
[Average read time: 4 minutes]
If you haven’t already, check out our previous post, Can text-to-speech be used for video voice-over? Yes – we have proof. It details the specific challenges of using non-human TTS for video voice-over – specifically, that TTS voices can’t adjust their performance and speed to match a video, in the same way that humans can. Moreover, the post discusses the production workflow required by TTS, which features more post-production work, but no studio sessions. We’ll use this previous post as a jumping-off point for today’s conversation, and the videos in today’s post are localized versions of the video in that post.
With that said, following are the localized videos. First, in Spanish for Latin America:
And second, in French for France.
Now that you’ve seen the videos, following are the special challenges of using TTS for video dubbing.
Part of the audio translation process is timing the foreign-language voice-over to the original English-language video. If you’re producing a Cantonese video translation project, the Cantonese voice-over has to line up to the original English video – it can’t be longer than the English video, for example, and has to hit internal synchronization points.
Human VO talents have the ability to slow and speed up their delivery so that their audio lines up to video playback in a studio – they can literally watch the video as they record. Really great talents can even watch an actor’s lips move, and match them – this, of course, is lip-sync dubbing.
Needless to say, TTS voice fonts can’t do this. This means that any synchronization to video has to happen during post-production. In general, this means that lines that are too long have to be sped up to fit to picture.
This is true for all TTS. The voice fonts have gotten really good at expressing very general emotions – for example, their intonation rises at the ends of sentences, and they get louder when there’s an exclamation point in the text. However, they can’t express emotion beyond this. You’ll notice that in the videos above, the performances aren’t exactly Oscar-worthy.
This isn’t an issue, however, for most corporate or e-Learning content, or for any multimedia product whose main purpose is specifically to relay information.
TTS isn’t available in all languages, though most of the major world languages have some TTS voice fonts. That said, though, while English has dozens of voice fonts, most languages only have one or two. For example, Czech video translation projects using TTS have more voices available to them than Farsi video translation projects. Some languages that have many speakers don’t necessarily have many voice font options. This is true, for example, of Chinese TTS.
This means that the casting in English-language videos can’t be fully reproduced in other languages, even if they have TTS voices available to them. Fortunately, most corporate and e-Learning content don’t have large casting requirements, but again, it’s good to keep in mind. In the videos above, we sometimes had a slightly hard time lining up French and Spanish TTS voices to the original English ones.
It’s definitely not an ideal solution for online marketing video translation, TV or radio spots, or feature films – or any content that relies on performance, turn of phrase, or even regional accents. But if it’s workable for your content, TTS provides a cost-effective workflow that can be implemented incredibly quickly, especially for large amounts of content. The key is to explore TTS early on in the localization process (if possible, before developing the English-language source), to determine whether it’s right for your videos, and whether there is enough TTS support for your target languages. If so, though, it could dramatically lower your multimedia localization budget and timeline.
Download “7 Myths of Audio & Video Translation,” JBI Studios’ indispensable guide to audio translation and dubbing.