To clean or not to clean in voice-over audio translation

What are forced subtitles in video translation?
February 27, 2017
Voice-Over or Subtitles? Choose the Right Video Translation Option
March 6, 2017

Cleaning is a must for all professional, high-quality audio voice-over productions. However, the sorts of elements that get cleaned for corporate and e-Learning voiceover are not the same as for entertainment or marketing applications. Even some e-Learning content has different cleaning requirements. So how do you know when to clean out an element in the audio – and when to not clean it?

This post will list the three situations that require a different standard for cleaning in audio localization services.

[Average read time: 4 minutes]

First, what is cleaning?

Cleaning is part of the post-production phase, the studio editing component of voice-over translation services. A recorded audio file gets “cleaned” once the talent and director have gone home, leaving the studio engineer with a raw audio file. The engineer then reviews the audio that’s been recorded and removes elements that don’t sound good – like long pauses, or jarring cuts, or anything else in the track that’s not pure voice over.

[For a fuller explanation of what cleaning is – and for video examples – go to our previous post, What does it mean to clean files for audio & video translation?]


For most narration – whether for corporate, e-Learning or marketing applications – the engineer will clean out any breaths that are audible, by deleting them or silencing them. This is usually the brunt of the cleaning work, and one of the more time-consuming parts of post-production. In fact, really great talents develop breathing techniques that isolate their breaths, making it easier for engineers to get rid of them. The engineer will also perform other tasks during the cleaning, including normalizing the audio in any places that have shifts, tightening up pauses, and getting rid of voice imperfections. There’s a lot more to it, but the gist is that the engineer takes raw voiceover recordings and turns them into a high-quality, professional audio deliverable.

So when does the engineer do “less” cleaning – or really, a different kind of cleaning?

1. When the audio is tied to visuals – in which people breathe

If you’re standing in front of someone and he or she breathes in or out, you expect to hear a sound, even if it’s very slight. Same for speakers in a video – when they open their mouth to inhale, usually before speaking, the audience unconsciously expects to hear the sound of air moving. Therefore, cleaning breaths out of audio that’s matched to an on-screen presenter usually looks weird or unnatural. For this reason, the engineer has to take a completely different approach when cleaning videos with presenters or on-screen actors (like e-Learning scenarios). This can actually be quite difficult, especially when cutting different shots, which often have different microphone placements – which makes the audio come in at different levels. Sound engineers often spend hours just cutting and getting the breath levels to match.

This goes for dubbing as well – if a source speaker breathes, the dubbed voice must either breathe as well, or replace the original breath with a syllable or word that fits with the mouth sounds (usually open syllables fit best). Note that this isn’t the case for video translation services like UN-Style and Dialogue Replacement, since these services don’t line up the foreign voice over to an English-language speaker’s lips, so cleaning breaths out of them isn’t jarring to the viewer.

2. When there is emotion or performance

Humans use breathing to demote emotion. We gasp when frightened, exhale sharply when frustrated, huff when angry, and sigh when relieved. In fact, different kinds of breaths are such a staple of acting technique that it’s well-worn joke that actors who don’t have much experience will “exhale” their performances.

Because of this, any voice over application that requires performance – like video games, radio plays, book narration and audio e-Learning scenarios – also don’t clean out breaths when they convey emotion. This can sometimes get really tricky, since video game VO talents will also breathe normally during a session, or between lines. A skilled audio engineer will know which breaths to cut, and which to leave in – and sometimes even how to massage the audio to make the breaths themselves sound better.

Lip-sync dubbing is the most complex of all audio localization services in this regard, since its breaths convey emotion, and also have to line up to any visible lip movements on-screen.

3. For technical applications

Some voiceover recording applications – for example, to develop voice recognition systems, or to create tests for different training programs – require leaving breaths and extraneous noises made by the VO talents in the recordings. Usually, the goal of these recordings is to replicate a real conversational situation – in which case, cleaning can adversely affect how “real” the recordings sound. While this category is less common, it’s always crucial to check with these kinds of clients when developing a cleaning workflow.

Client input is crucial for special applications

Ultimately, if you’re in doubt about how to clean files, get input from the client as early on in the process as possible. While most audio localization services, from narration to dubbing, have well-defined standards for how to clean, there are sometimes very clear exceptions, especially for new media. It’s crucial for professional voice over companies to discuss the final deliverables with clients – in particular how they’ll be used – to make sure the cleaning adheres to the final deliverable requirements.