Tuesday, January 11, 2005

BOM -- Huh?

Every now and then, we wake up to an ugly technical surprise usually just before we're supposed to deliver a project to a client. One of the more memorable ones that has had a lot of repercussions was the day we met our first BOM.

What's a BOM?
I'm embarrassed to admit it now, but there was a time when I didn't know what a BOM was. Had I Googled on it, I would have seen it just a few entries below the Australian Bureau of Meteorology, but Google isn't much help when you don't know the name of what you're looking for. (Aside: why did the intelligent scientists at this bureau incorrectly elevate the status of "of" to the "O" in BOM? Is it that they don't want to be known as the "BM?")

According to the BOM FAQ on the Unicode home page, BOM is short for byte order mark, three fancy characters that you might find at the beginning of a UTF-8 file. (UTF-8 is a compact but universal character set that we use a lot for multilingual websites. This blog is published in UTF-8. More about that in another posting.) Ironically, UTF-8 comes in only one order, so a BOM doesn't matter in the way it does with full-bore Unicode.

The problem with the BOM is that some text editors interpret and hide the characters. You don't know that they're there. Here's what they look like:



These babies mean "You are about to see some UTF-8!" Are they necessary? Usually not, because a well-formed web page should make that announcement anyway with a META tag. Are they dangerous? You bet!

Notepad for WindowsXP and the BOM
If you're like us and you want to review and approve every single byte that you deliver to a customer, an unexpected BOM can play havoc with your plans, with your and your tools' concept of how long the file is, and so on.

We learned about the BOM at the same time that we were upgrading to WindowsXP from Win98. One of the many gotchas was seeing our humble Win98 Notepad be replaced by Notepad for WindowsXP, a much more take-charge program. The old Notepad shows you the BOM characters, while the new one does not! With the new Notepad, if you save a file and select at the UTF-8 as the encoding in the Save As dialog box, the BOM will be pre-pended to your file, an unnecessary but seemingly harmless little detail.

But is it really that harmless?

Let's look closely at the definition of a BOM:

BOM. The character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files.

The operative term is "beginning." How can we be sure that a given text file that has been saved in Notepad, will be the beginning of a data stream? It may the footer of a web page, deployed with an include statement to the middle of another file. Now, unfortunately, our BOM has gone from the beginning of one file to the middle of another, and in that position it can do a lot of damage.

Another ugly place for a BOM to land is in a database entry, where it can ruin your best laid plans for running exact searches and concatenating strings.

Forewarned is forearmed. Hope this helps.

1 Comments:

koray said...

This entry sure did clear a lot of problems in my head. It is interesting how frustrating  can get, especially if they find their way anywhere but the beginning on your web pages (from an import statement).

Thank you.

9:46 AM  

Post a Comment

<< Home