Sunday, January 16, 2005

Naming files in a multilingual site

If you plan on translating your website to several languages, one of your first issues will be devising a naming scheme for the translated files and folders. I have some advice for you.

Don't change the file names! Don't change the folder names!
Instead, put the translated set of files in a new parent folder named for the target language, e.g., /ger for German. All the original files and folders (now translated to German, of course) should branch out from here. That way, all of your relative links from file to file should work without modification.


The above illustration shows a well-structured trilingual website in English, German and Spanish. The site was originally designed in English to contain three folders: /images, /press and /products. At the time, no one thought the site would be translated, so a folder for English was never contemplated, but that's OK.

Later, German and Spanish were added under the folders /ger and /spa. Below these language folders, the structure of the original site is replicated. The names of the HTML and image files remain exactly the same so that the relative links don't need to be touched. That's it.

PDF files
I wrote this very basic posting to mention an exception that proves the rule: PDF file names. PDF files should incorporate their language in the file name itself, e.g., brochure_eng.pdf and brochure_spa.pdf. Why? Because PDF files have a life beyond the website where they are stored, as attachments to emails and as loose files on someone's hard disk. The file name should show the language that the file is written in. Furthermore, if the brochure is written in three languages under the same name and you want to attach all three versions to the same email message, you'll have a nasty little naming conflict.

Tuesday, January 11, 2005

BOM -- Huh?

Every now and then, we wake up to an ugly technical surprise usually just before we're supposed to deliver a project to a client. One of the more memorable ones that has had a lot of repercussions was the day we met our first BOM.

What's a BOM?
I'm embarrassed to admit it now, but there was a time when I didn't know what a BOM was. Had I Googled on it, I would have seen it just a few entries below the Australian Bureau of Meteorology, but Google isn't much help when you don't know the name of what you're looking for. (Aside: why did the intelligent scientists at this bureau incorrectly elevate the status of "of" to the "O" in BOM? Is it that they don't want to be known as the "BM?")

According to the BOM FAQ on the Unicode home page, BOM is short for byte order mark, three fancy characters that you might find at the beginning of a UTF-8 file. (UTF-8 is a compact but universal character set that we use a lot for multilingual websites. This blog is published in UTF-8. More about that in another posting.) Ironically, UTF-8 comes in only one order, so a BOM doesn't matter in the way it does with full-bore Unicode.

The problem with the BOM is that some text editors interpret and hide the characters. You don't know that they're there. Here's what they look like:



These babies mean "You are about to see some UTF-8!" Are they necessary? Usually not, because a well-formed web page should make that announcement anyway with a META tag. Are they dangerous? You bet!

Notepad for WindowsXP and the BOM
If you're like us and you want to review and approve every single byte that you deliver to a customer, an unexpected BOM can play havoc with your plans, with your and your tools' concept of how long the file is, and so on.

We learned about the BOM at the same time that we were upgrading to WindowsXP from Win98. One of the many gotchas was seeing our humble Win98 Notepad be replaced by Notepad for WindowsXP, a much more take-charge program. The old Notepad shows you the BOM characters, while the new one does not! With the new Notepad, if you save a file and select at the UTF-8 as the encoding in the Save As dialog box, the BOM will be pre-pended to your file, an unnecessary but seemingly harmless little detail.

But is it really that harmless?

Let's look closely at the definition of a BOM:

BOM. The character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files.

The operative term is "beginning." How can we be sure that a given text file that has been saved in Notepad, will be the beginning of a data stream? It may the footer of a web page, deployed with an include statement to the middle of another file. Now, unfortunately, our BOM has gone from the beginning of one file to the middle of another, and in that position it can do a lot of damage.

Another ugly place for a BOM to land is in a database entry, where it can ruin your best laid plans for running exact searches and concatenating strings.

Forewarned is forearmed. Hope this helps.

Wednesday, January 05, 2005

If you want your website to sound like a robot...

...then use a machine to translate it!

You can use software to translate any kind of text. There are several sites that will do it for free. Perhaps the best are run by Systran and Google.

As a demonstration of the state of the art of computational science, as a personal tool or as a novelty item, machine translation is wonderful, but the quality of the result isn't reliable enough to use for production purposes. It's one thing for an end user to say, "I need to understand the gist of this article about my company in a German newspaper." It's another thing altogether for a company to say, "Let's use a machine to translate our web site." In the former case, the user knowingly takes the risk of reading a botched or awkward translation. In the latter, users read a corporate website, unaware of the circumstances behind its translation. Their critical defenses are lowered, they're vulnerable to any kind of misinterpretation.

How good is machine translation?
Machine translation is surprisingly good, but in the end it's not good enough.

Look what Systran's engine does to the following sentence translated from English to Swedish and (for those of us who don't speak Swedish), back to English. I chose Swedish because it was at the top of their list.





English original: I took my daughter for a walk in the park.
Swedish translation: Jag tog den min dottern för en gå i parkera.
English back translation: I took that my daughter for a to go in parking.


Admittedly, this is not a fair test, because the back translation tends to magnify the errors in the first translation. The translation to Swedish is probably only half as bad as the back translation to English, but half of horrible is not good enough.

What good is machine translation?
We've already described the most popular use of MT -- to allow end users to selectively and instantly understand the gist of an otherwise unintelligible document, especially on the Web, where the content and the tool are both simultaneously available. But this is largely an underground activity. It can't be controlled by the publisher of a website and it doesn't appear on any ledger as a cost savings.

MT could save some serious money for a web publisher in one discrete area: the translation of a corporate knowledge base. KB's cost a lot to create and maintain in English, and are rarely translated to other languages. They utilize a constrained terminology and are written by a small group of expert authors who are trained to write in a precise and clear way. It wouldn't take much extra effort to condition their writing to be translated by a machine translation program armed with the corporate terminology.

The resulting translated KB should carry an armor-plated disclaimer to the effect that the translation is not guaranteed to be any good at all, and should allow the user to read the original English article. Why not give it a try? Many of the non-American readers of KBs are technical trained professionals who are nominally bilingual of necessity, but would find it much easier to scan several articles in their native language before finding and studying the one that matters.

That's all for now. I have to take my daughter for a to go in parking.

Saturday, January 01, 2005

Welcome to Knowledge Bits

I'm Robert Hopkins, the president of Weblations. Since early 1996, when I founded Weblations in Barcelona, my colleagues and I have been translating and localizing websites -- some 600 projects in all. We've learned a lot over the years, and developed some opinions as well. In this blog, we'll share some of the knowledge bits that we've picked up over the years. If you are running your own translation project, we hope that they make your life easier. If you're looking for an agency to outsource your project to, this blog will give you an insider's look at the issues we face on a daily basis, and hence the kind of skills you should look for.

To translate a website, you and your team need to know about culture and languages, copy writing, the technologies used by the website that you are translating, graphic design and the tools that you use to do your work. In our case, when we started the company, there were no adequate tools to assist a professional, high volume website translation team, so we decided to develop our own. Fortunately, the Web wasn't very complicated back then, and we were able to put something together in relatively short order. Since then, as the Web has enriched, we have enriched our two core applications, Weblations Cypher and the Weblations Workspace. You'll hear a lot about them in this blog, I bet, since developing them takes up a big part of my working day.

I didn't really start this blog on New Year's Day, but I will pretend I did to give me some breathing room and to remind me that this is a New Year's resolution to be honored.

Enough of introductions...on to the blog!