Friday, February 25, 2005

Sniff your web

Before you publish any web content in a fancy character set, be sure to sniff your web. No, it shouldn't smell. Sniff for your web server's HTTP response header.

The problem
If there is a clash between the HTTP header and the character set of the page you are publishing, most browsers will try to render the page according to the HTTP header. The result can be gibberish.



Here for example is a Japanese page, encoded in UTF-8, being delivered by a server that claims it is encoded in ISO-8859-1, the encoding for standard Western European languages. Not a pretty site!


We were hoping to see something like this. What happened?

Background
When a browser renders a web page, it needs to know what character set the page is encoded in. There are three ways it can find out:
  1. Read the META tag in the header of the file,

  2. study the text on the page and guess, or

  3. read the web server's HTTP response header.
Before going any further, let's be clear: the file header mentioned in method 1 is the stuff between the <HEAD> and </HEAD> tag at the beginning of an HTML file. The HTTP response header of method 3 is the data that the server sends to your browser before it begins to send the HTML file. You can't see it without the aid of a web sniffer.

Method 3 is the least widely known and the most powerful. It's the gotcha method.

The <META> tag of the first method is the most popular, with good reason, because it allows you to specify (and change) your encoding from one file to the next. Here is the meta tag for UTF-8:

<meta http-equiv="Content-Type" content="text/html;charset=utf-8">

I copied this tag from the header of the Japanese page above. Unfortunately, its server was singing a different tune.

Web sniffing
Here's where web sniffing comes in. Got to http://www.web-sniffer.net, enter the URL of the website you are sniffing, and look at the results...



The culprit is in the last line: "charset=ISO-8859-1". That's a direct conflict with the UTF-8 encoding that we actually intended.

Solution
Should you run down the hall to tell your webmaster to change the HTTP header to your new encoding? No! To do so would render your pages correctly while breaking those of everyone else.

If your webmaster is extremely kind and patient (possibly True) and has a lot of extra time (definitely False), you could ask him or her to set a special encoding for your folders. But why bother. The easier and better practice is not to specify any charset at all in the HTTP header. It should say simply "text/html", so that the encoding of each page can vary according to the capricho of its author.

15 Comments:

identy said...

Buen trabajo....

11:01 PM  
Anonymous said...

This post has been removed by a blog administrator.

2:01 AM  
Anonymous said...

This post has been removed by a blog administrator.

2:01 AM  
Anonymous said...

This post has been removed by a blog administrator.

2:02 AM  
Anonymous said...

This post has been removed by a blog administrator.

1:28 AM  
Anonymous said...

This post has been removed by a blog administrator.

3:20 PM  
Anonymous said...

This post has been removed by a blog administrator.

5:42 AM  
Anonymous said...

This post has been removed by a blog administrator.

3:33 PM  
Anonymous said...

This post has been removed by a blog administrator.

10:46 PM  
Anonymous said...

This post has been removed by a blog administrator.

6:35 AM  
Anonymous said...

This post has been removed by a blog administrator.

8:21 PM  
Anonymous said...

This post has been removed by a blog administrator.

8:26 AM  
Anonymous said...

This post has been removed by a blog administrator.

10:12 PM  
Anonymous said...

This post has been removed by a blog administrator.

12:52 PM  
Anonymous said...

This post has been removed by a blog administrator.

3:43 AM  

Post a Comment

<< Home