A Few Words on Internationalization

By Pavel Doronin, translated by Julia I Oden.

Having been a reader of Harbr for a while now, I noticed that there is only a handful of intelligible articles on software localization aimed at developers. Based on my experience managing localization projects, I can say that localization is not just about in-line translation and adapting an application within the context of this or that country, but is also a constant battle (in an ideal case – a productive collaboration) with its developers.

In this article, I will try to showcase with real examples of how to create the so-called “localization-friendly code” – that is, how to organize resources to substantially ease software localization, reducing excessive time and financial expenditures.

I need to clarify upfront – the primary focus will be internationalization: the process of accounting for all linguistic particulars during the development stage. If your project resources did not account for localization from the beginning, yet you decided to brave the waters at a later time, “honing” them to localization standards can be much more costly than setting them as a goal from the start.

Use Unicode

In most cases, the question of coding using UTF-8 (or UTF-16) comes up when planning localization into Asian languages, where the number of symbols can reach several thousand. Even if currently localization into Korean or Chinese is not in the plans, it is worthwhile to account for the universal coding ahead of time. If the localization strategy of your product changes, it will be much more difficult to jump onto another coding system mid-stream.

Tip: for all resources, use Unicode as your default, even if the project, for now, is only in Russian/English/any other language.

By the way, JSON and YAML specifications (these formats are often used for saving the localized resources) assume the use of Unicode.

Beware of fonts

This seeming nuisance is often a critical factor stalling localization. Make sure that the fonts you use have symbols for the languages you are localizing into (primarily, again, those Asian languages, as well as Hebrew, Arabic and diacritical marks of the European languages).

Remember that

ä, à or ą ≠ a

same as, in Russian, “е” does not always equal “ё.”

I worked on a case, where developers drew a font containing only English-language letters. When it came to localizing into German and Polish, they had to add letters with the diacritical markings.

Leave room for maneuvering

Besides fonts, text (string) translation of applications holds yet another sunken gem to be discovered. A hidden obstacle to overcome.

Compare the translation of one menu item into different languages:

ru: Сохранить как
en: Save as
fi: Tallenna nimellä
zh: 另存为

The Chinese translation requires just three characters, while the Finnish translation requires the whole sixteen! Besides the sheer number of characters, the specific characteristics of this or that font are also important.

Let’s compare the length of Finnish and Chinese lines (font for both languages is Arial Unicode MS, 12) — Finnish text (114 px) is 2.5 times longer than Chinese (45 px).

It is highly important to have extra space in interface elements to avoid cutting off the text. If in certain cases there is not enough space, it is possible to use automated text sizing tools. Yet this decision will most likely lead to having the final text be of different sizes in different elements of the interface.

Pseudolocalization

The use of pseudolocalization can be a useful aid in catching problem spots before translation begins. It is a way to test whether an application is ready for localization. In place of translated text, a pseudo language, created based on a specific algorithm (which depends on the software at hand) is used in development resources. Using the most primitive example, the English text is being substituted with Cyrillic transliteration/transcription letters:

Save as -> Саве ас
Save as -> Сэйв аз

This method allows us to check for the following:

Are the diacritical marks reflected correctly (ex., German, Polish)?
Are languages with different fonts reflected correctly (ex., Chinese, Russian)?
Are there any issues presenting interface elements for languages with the right-to-left text direction (ex., Arabic)?
Are there issues with presenting non-standard characters (ex., usernames)?
Are all localized resources extracted in separate files (using text directly within code carries numerous issues; see the part on Hardcoding below)?

Often, in pseudolocalization, computer translation of the text into the target language is used. On the one hand, this is a simple decision in the absence of special means for generating pseudolocalization. On the other hand, I saw more than once how developers confused localized resources with pseudolocalized ones and even substituted normal translation with their machine translation files from previously saved versions. Moreover, machine translation does not always allow full evaluation of all characters in a language (for example, letter œ is not encountered frequently in texts, yet its presence is also an important one to test).

For example, this is how the pseudo translation plugin interface of the MemoQ software looks like

[MemoQ Screenshot]

And this is how the result looks like with those settings:

[Results Screenshot]

External resources

In order to have a full review of localization materials, it is necessary to have all resources from the code base. Multimedia information containing text (most often, these are images, as well as video and audio, as in games), should be stored separately, sorted by locale. Firstly, this will significantly simplify the job of content creators, as they will not have to dig through code when needing to correct some system message. Secondly, it will allow the localization manager to correctly calculate timelines and budget for each language. Thirdly, this will lead to significantly more flexibility in working with multilingual content.

The favorite formats for exchanging localization data are XLIFF and .ro-files. Through a variety of interfaces, modern automated translation systems are capable of transforming various files into formats usable by translators.

Google and Apple also insistently advise developers to extract all localization resources:

Hardcoding in internationalization

Localization assumes not only word translation, but also the adaptation of numbers, units of measurement, date and time formats, as well as punctuation marks to fit local standards.

Punctuation marks

Many developers like to “sew in” punctuation marks into code, thinking that surely periods and question marks are the same across languages. Yet compare the following:

ru: Вы уверены?
en: Are you sure?
fr: Êtes-vous sûr ?
es: ¿Está seguro?
ar: هل أنت متأكد؟

In French, the question mark is separated by a space (incidentally, Habr insisted on removing the space before the question mark, so I had to get creative with tags). In Spanish, question mark consists of an upside down question mark in the beginning and a regular one at the end of a phrase, whereas in Arabic it is put on the left and is turned in the opposite way. If a question mark is being generated from code, not all users would be comfortable reading such message (unless code accounts for the locale differential, but why resort to such perversion?).

Besides punctuation marks, it is important to be careful with spaces; trusting the code to insert them would be a mistake. There are languages that do not use spaces between words, as in Japanese. It is said that localization of Japanese and Chinese applications/programs into European languages can be pure hell if developers do not account for such a nuance as word spacing differences among languages.

Punctuation is part of the text, so it should be carried out into external resources.

Numbers

Numbers, like words, require translation. Many developers forget that and incorrectly carry over numeric references using familiar formats. Let’s compare:

ru: 18 765,22
en: 18,765.22
de: 18.765,22
he: 18,765.22
el: 18.765,22
fa: 18٫765.22

Notice which symbol is being used as an indicator of decimal and denomination indicators. In English and Hebrew, a period and a comma are presented quite differently than in German and Greek languages. And in Russian, a space is used to separate numbers >9999. In Farsi, thousands are separated by a specific symbol mommae (U+066B), yet there is no particular standard for this language, so a comma or even a space can serve as separators. These can be seen as nuisances, of course, “those who need to understand this, will understand it in any format.” However, such little things can sometimes lead to serious misunderstandings, especially when talking about prices and important engineering calculations.

Speaking of prices, let’s compare:

ru: 2,25 €
en: €2.25
de-at: € 2,25
de-de: 2,25 €
lv: € 2,25
lt: 2,25 €

Monetary units are positioned differently in different languages, which means that it is better not to hard code these symbols, either. Especially since, as you can see, the norms differ not just among the languages, but also within the different versions of the same language (in Austria and Germany). Even the neighboring countries, like Latvia and Lithuania, have different norms.

Units of measurement

Sometimes, it is necessary to adapt not only the outward appearance of a number to international standards but also the very number itself. I am talking about units of measurement. If they are used in your project, it is always good to find out which system of measurement is used in a particular country in order to report intelligibly to a user about speed, length, mass, temperature, etc.

A statement “You are moving at a speed of 62 miles per hour” will mean nothing to a driver from Pskov [Russia]. Similarly, “You are moving at a speed of 100 kilometers per hour” may put a Chicago driver into a stupor.

In such cases, it is not enough to simply present a different a numeric variable; one needs to dig deeper and change the calculation formula depending on the location of the user. An ideal solution still would be to present a way to let the user change settings within the software, making that independent of the location. In either case, local units of measurement need to be accounted for.

Not all languages have the same grammar principles

Dividing text into semantic segments

When organizing textual lines, some developers do not take into account grammatical structure of other languages and divide the text in each line into semantic fragments. As a result, texts are pieced together based on the rules of Russian syntax (or developer’s native language). If an English translation can sometimes be tricked into that formula (although not always), then when working with German, for example, with its rigid rules for word order and sentence structure, this way of creating a text yields complete nonsense. And with Arabic, which uses an opposite direction for producing written text, such method of content organization is completely useless.

Here is a rather well-known example. Russian speaking user sees this text: «До окончания тестового периода осталось 5 дней. Пожалуйста, введитедействительный ключ.» (literal translation: “5 days remaining until the end of the trial period. Please enter the valid [activation?] key.”

In source code, this message may look similar this:

‘trialexpires_1’: “До окончания тестового периода “‘trialexpires_2sg’: “остался ” ‘trialexpires_2pl’: “осталось “‘trialexpires_4sg’: ” день.”‘trialexpires_4pl2’: ” дня.”‘trialexpires_4pl3’: ” дней.”‘enterkey’: “Пожалуйста, введите действительный ключ.”

Truly, it is possible to get creative and “sew” together these text “swatches” into English in a way that the translation is quite sound. But in Arabic, where text direction is different, this trick will not work. In German, the stand-alone verb prefixes always tend to run to the end of the sentence. Incidentally, pay attention again to the length of this phrase in different languages – the German version is 30% longer than the one in English. Verbs are highlighted in bold. As you can see, they can consist of two parts in German, one of which parts can be quite a long way from its counterpart.

en: Your trial period expires in 5 days. Please enter the valid [activation] key.
de: Ihre Testversion läuft in 5 Tagen ab. Bitte geben sie einen gültigen Produktschlüssel ein.

Another deficiency of this method is that with such presentation, the translator cannot always glean the logic of a sentence and create an accurate translation. Imagine how easy it is to get lost in these text strings when dealing with about five thousand of them. All this tells us that, if possible, it is best to put an entire line into resources, so it not only has a more universal

format, but also is understandable to the person responsible for translating it.

A possible solution of such situation is below:

‘trialexpires’: “До окончания тестового периода [count:остался|осталось] {%n} [count:день|дня|дней].”‘enterkey’: “Пожалуйста, введите действительный ключ.”

Operator “count” (or however you wish to name it) puts in the necessary textual meaning depending on the numerical variable %n. An Arabic translator, who writes from right to left, would just switch the location of the variables.

Page layout via a hard line break

Quite frequently, a problem arises when developers aim to achieve the necessary text layout via a hard line break.

Here is an example. The user sees the text in this way:

Этот текст такой большой,    This text is so large,
а окно такое маленькое,      And the window is so small,
что мне придётся разбить    That I have to break
его по строкам.                   It down into separate lines.

In resources, it can look like this:

‘menubox_string1’: “Этот текст такой большой,”

‘menubox_string2’: “а окно такое маленькое,”

‘menubox_string3’: “что мне придётся разбить”

‘menubox_string4’: “его по строкам.”

A translator, looking at such horror, will spend much more time trying to adapt the text to his language. If the text is longer (German or French languages), then four lines may not be enough. If the text is shorter (Japanese or Chinese), then a couple of lines will be empty. Not to mention that if translation software of any kind is being used (during such process each line is added to translation memory and is then used in analogous and similar lines again in the future), then such line division cannot yield effective localization.

There are two ways around this. First, use an automate text sizing based on the window’s size. Second, if you do not wish to trust the machine to do your line breaks, use \n.

Then resource text will look like this:

‘menubox’: “Этот текст такой большой,\nа окно такое маленькое,\nчто мне придётся разбить\nего по строкам.”

English Translation:

‘menubox’: “This text is so large, \n and the window is so small, \n that I have to break it down\n into separate lines.”

In such instance, a line break is much more flexible. For example, a translator can be given the max number of characters per line and then be asked to create the most logical line breaks within that number.

Excessive optimization

This mistakes can often be made by over-eager content managers. Especially those who optimize English language texts. In overly optimized resources, everything possible (all keywords and even whole phrases) is replaced with constants so that at the time of localization it can be substituted without worrying about cases, articles and other grammatical aspects of a target language. Needless to say, this allows for better control of terminology use sequencing, as well as possibly significant savings in translation expenses. However, every optimization needs to be reasonable. Let’s look at this example:

The user sees the following text:

You can launch the application from the terminal. Press F2 to access the terminal.

In resources, it is pieced together with the following strings:

‘cmd’: “the terminal”‘app’: “the application”‘act_42’: “Press F2″‘run_from_terminal’: “You can launch {app} from {cmd}. {act_42} to access {cmd}.”

Let’s suppose that the interface has many repetitive words and phrases, which content manager decides to replace with constants. These constants he uses within the text, as it is convenient. If suddenly the word “terminal” is no longer deemed appropriate and is replaced with “command line,” or “system terminal” is replaced with, say, a “menu,” there is no longer the need to rework a large amount of text. It would be sufficient to simply change the meaning of a constant.

An added bonus would be the reduction of the overall word count. Most translation costs are based on the number of words (much less commonly on the number of lines), thus you would think that the overall cost of localization can be reduced. Yet that is not the case. Remember that I already mentioned before that not all languages work along the same grammatical rules? This is highly important here.

Let’s see how resources would look like translated this way into Russian.

‘cmd’: “терминал”‘app’: “приложение”‘act_42’: “Нажмите F2″‘run_from_terminal’: “Вы можете запустить {app} из {cmd}. {act_42} для открытия {cmd}.”

The user will see the following:

«Вы можете запустить приложение из терминал. Нажмите F2 дляоткрытия терминал».

If the word «приложение» (software program/application) is changed to «программа», then things get worse, yet more obvious.

«Вы можете запустить программа из терминал. Нажмите F2 для открытия терминал».

It is clear that case as a category is not considered in this approach. One does not need to look far for similar examples. It is enough to look at an awful localization in Foursquare:

Or look at filter names. Not all of them are options for continuing the phrase «Показать места…» (“show places”). Possibly, this is a reflection of constants being used elsewhere. Or just a thoughtless translation and lack of localization QA.

Facebook constantly improves its localization through volunteer users (not too long ago, they finally did post a vacancy for a localization manager, so hopefully things will get even better), yet this particular line does not sound very Russian since it is created based on the rules of the source language.

Translators Note: The problem here is that in Russian the object of the preposition “Perm” is not in agreement with the verb “studied”.

(Studied in Perm State University)

In Russian version, it is better to write «Место учёбы: %ВУЗ%» (“Place of Study: %(VUZ, an acronym for Higher Education Institution in Russian”).

Here is a similar example from another area:

2005 Graduation from Perm State University (same agreement issues here as well – translator’s note)

The conclusion: use of text constants is, undoubtedly, useful, yet they have to account for other grammatical systems. The ideal approach is the use of numeric constants, constants for units of measurement (taking into account grammatical particulars of each language; Russian, for example, has two plurals: 1 уровень, 2 уровня, 5 уровней (Translator’s Note: 1 level, 2 level, 5 levels all have different endings in Russian), titles and names (names of software products), keyboard keys associations.

Conclusion

Traditionally, software localization has been separated from development. Moreover, many project managers think of localization as a simple substitution of the original text with foreign text. As a result, the whole product suffers, due to these issues:

Unoptimized resources increase localization workload;
Bugs found during the localization process increase the time it takes to bring product to market and again lead to increased work to resolve them;
Localization budget increases constantly;
“Skewed” localization affects the number of software purchases/downloads in any particular region and gives competitors an extra advantage. My personal opinion – a poorly localized product is much worse than a non-localized one.

Even if a program has been designed for a local market, localization may still be necessary. It is possible that in another couple of years there will be a need for “Yandex.maps” in Tadzhik language. (Translator’s note: The author is referring to a surge of migrants from Tadzhikistan in Russia over the last few years)

Translator’s Note: Yandex is a popular Russian-language search engine, offering many services similar to Google.

Try to develop applications with internationalization in mind and work with your localization manager and your translation agency at the development level in order to save yourself time, resources and money, as well as to ensure the highest quality, local versions of your products.