Captioning and subtitling are now a must for any video production project – many countries require captions and subtitles for the hearing impaired (SDH), and subs translation is incredibly cost-effective. Tab-delimited and coded text deliverables, in particular, have made distribution cost-effective, scalable, and incredibly easy-to-implement – though certainly not trouble-free. After all, all text deliverables are digital code, which can be rendered moot with one character inserted or deleted during translation.
This post will list the three most common code errors inserted during translation, and what you can do to avoid them.
[Average read time: 4 minutes]
Text files are the most-requested deliverable for subs video translation. This is quite a change from even five years ago, when most projects required burning them to picture, or delivering them as graphics for overlay. There’s a simple reason for this change – online video streaming platforms like YouTube, Vimeo, Netflix, Amazon, and Hulu, as well as channel-based apps. As more content has moved online – including TV shows, movies, marketing spots, e-Learning and instructional videos – the subtitle deliverables have followed suit. So have dubbing and voice-over deliverables, by the way – most of those platforms also support multiple foreign-language audio channels.
Today multimedia localization professionals must have a working knowledge of text-based formats like SRT, STL, WebVTT, and SCC. If they’re working on e-Learning or corporate content, that adds DFXP and TTML to the mix, XML-based formats that come from Flash. All of these formats have different levels of complexity, ranging from SRT, which displays time-codes and caption units in a relatively user-friendly way; to SCC, which encodes each character as a specific binary hex code, as you can see in the following side-by-side comparison:
No matter their complexity, all captions text files have one thing in common – strict requirements in terms of their structure. One change to that structure in the time-codes, or even in the number of tabs or spaces in a file, can have disastrous consequences. Naturally, this is an issue in translation because linguists often have access to this code as they do their work, and sometimes they make mistakes.
Following are the most common ones that can be real “code-killers.”
Can you spot the issues in this SRT file (it’s the same one as above)?
They’re circled in this following picture:
There were actually four issues – a sequence of two hyphens converted to an em-dash (#1), a space inserted in the middle of an SRT “arrow” (#2), a space inserted before an end time-code (#3), and a tab inserted at the end of a line (#4). Note that #3 is nearly invisible, and #4 completely so.
All of these mistakes would’ve caused issues when adding this SRT to a video on most online players. In fact, most players give an integration warning, but usually just for the first line with an issue. And these issues can be very difficult to fix, especially if they’re invisible, like the tab above. And of course, an SRT file for a feature film or TV show can have hundreds or even thousands of segments, so a widespread issue can mean hours of frustrating labor.
And these mistakes are very easy to make – most people often lose their place in a long document and hit keys on their keyboard, or delete text and then not replace it correctly. Ultimately, simple human error will engender this kind of issue in the work of even the most diligent linguists.
Many text files support font formatting, screen placement, or various other special formats. Most of them, in fact, use standard XML tags, including in some formats that aren’t XML-based to begin with. If you’ve translated in XML, you know how easy it is to misplace those tags – even one <i> open code (for italics) without its corresponding </i> will throw off an entire string. That applies to tag hierarchies as well – just one out of place will invalidate the code. We see this issue regularly in German localization, for example, since the syntax in that language is so completely different from English, and translation requires moving a lot of tags around.
If you’re using a format with tags, make sure that your linguists are familiar with XML in general or the file format specifically.
Most of the time this error occurs as simple human mistake, much like the inserted spaces, characters or tabs in the first item. But sometimes this happens because linguists will change the time-codes themselves, usually to combine two English-language segments, or to split them up when the translations are too long to fit. The mistakes fall into two main camps – first, time-code structure errors (like a missing reel number, a missing colon, a frame number that doesn’t fit within the frame-rate, or just missing decimal number); and second, time-codes that overlap with the previous or next subtitle, which are common when translators try to lengthen the on-screen time of a particular segment.
These mistakes are particularly difficult to fix, especially for double-byte language projects, like Japanese and Chinese subtitling, since they require a linguist, as well as a professional time-coder who can re-spot those sections of the video.
Fortunately, you can avoid these issues by doing the following:
That last sentence is good in general for audio and video translation projects. Rushing through caption & subtitle projects can lead to more human error, which means more bugs and longer QA cycles. Planning projects thoroughly – even during the original English-language project’s post – is the best way to makes sure that audio & video translation projects run smoothly, release on time, and stay on budget.