Switching to UTF-8

Howdy,

I’ll start my first post to the dev blog with a little something on UTF-8 in PunBB 1.3. For those of you who are not aware of what UTF-8 is, here’s the short story.

UTF-8, short for 8-bit Unicode Transformation Format, is a variable-length character encoding for Unicode. What this means is that one character can be represented by one, two, three or four bytes depending on what type of character it is. The first 127 characters are encoded “as themselves” and are identical to the characters of the character encoding US-ASCII. What this means is that for regular ASCII encoded text (essentially all text written in English), there’s no difference. Show someone a piece of text that includes only the first 127 characters and they won’t be able to tell you if it’s ASCII or UTF-8. However, when we need to use other characters, for example characters with accent marks or characters from the Cyrillic alphabet, UTF-8 uses more than one byte. Although UTF-8 can use up to four bytes, it is relatively uncommon. Pretty much every character you’ve ever seen can be encoded using one to three bytes.

So, what the point of using UTF-8 then? Well, in order for a computer program, for example a browser, to be able to correctly display characters, it needs to know how the text is encoded. We tell the browser how the text is encoded via an HTTP header or via the outdated use of a META tag. So, if we wanted to display Swedish text on a page, we’d tell the browser that the text is encoded using ISO-8859-1. This works just great, but what if we wanted to display Russian, Chinese and Swedish on the same page? In that case, we’d have to encode the Russian and Chinese characters using HTML numeric character references (not good) or we could switch to using a Unicode encoding, e.g. UTF-8. Problem solved. Once we’ve switched to UTF-8, we need not worry about character encodings ever again. A bold statement, but I believe it to be true.

After that rather lengthy intro, I thought I’d talk about how this relates to PunBB and some thoughts on how we should migrate PunBB to UTF-8. Up until now, PunBB has relied on non-Unicode character encodings to display for example Swedish, Russian and Chinese. This causes all kinds of problems and it’s something we definitely want to move away from in PunBB 1.3. Moving to UTF-8 is not trivial though. There are many things to consider and lots of caveats along the way.

The first step is to change the encoding. To be on the safe side, we’ll instruct the browser that the content is UTF-8 both by setting the META Content-Type tag and sending an HTTP header. What this means is that the browser will interpret the page as UTF-8 and consequently, any text sent back to the server will be UTF-8 encoded by the browser. This will make sure that any new content will be UTF-8. It won’t, however, automagically convert existing content to UTF-8 (people upgrading from 1.2). For these people, non-ASCII characters that are already in the database will display as question marks.

In order to deal with this, we have to somehow convert all the content we have stored in the database from the character encoding it’s using right now to UTF-8. In most cases, the current encoding will be ISO-8859-1, but there are over 40 language packs for PunBB and some of them use other encodings (for example WINDOWS-1253 for Greek). I’ve been thinking about this these last few days and unfortunately, I don’t think there’s an easy way to do this. The simplest way is to take a dump of the database, run “iconv -f WINDOWS-1253 -t UTF-8 file1.sql > file2.sql” and then re-import the dump. For ISO-8859-1 users, we could provide a script that grabs each and every text and varchar field from the database, runs PHP’s utf8_convert() on it and then writes it back to the database. This might take a while on a big database though. I’m all ears for any suggestions you people might have regarding this. Just leave a comment.

Provided we solve the problem with converting current content to UTF-8, there’s still a few things to consider. Most web applications I’ve encountered (this blog software included) store UTF-8 text in the database without the database being aware of that the text is UTF-8. In the case of this blog, the database thinks the text is ISO-8859-1 (or Latin-1 as the MySQL folks like to call it) because that’s the default character encoding in this particular MySQL installation. This isn’t a problem if we only store English text in the database. It does turn into a problem if we decide to store for example Swedish. Even worse if we start posting in Russian. The reason this is a problem is that for Russian, each character is represented using two bytes as opposed to one. If we post the Cyrillic character Ж in this blog, the database will believe we posted two separate ISO-8859-1 characters. Now, what if we want to have our database return the first 20 characters in a post that is written in Russian. Since it believes the text is encoded using ISO-8859-1, it will just fetch the first 20 bytes when if fact, it should have fetched the first 40 bytes (Cyrillic characters are two bytes in UTF-8). Another example is if someone registered in the forums with the username Åsnan (Swedish for The donkey). The letter Å will be written to the database as two bytes. Now, if we go to the user list which by default sorts the list by username, we’ll notice that Åsnan won’t be positioned where he should be (between the letters Z and Ä in the Swedish alphabet). The problem is that since the database thinks the text is ISO-8859-1, it sorts it like ISO-8859-1. This isn’t the whole truth as there is something called collations to take into account, but you get the point.

The ideal would be to instruct the database that the content we’re sending it is UTF-8. In some cases, this is doable. In others, it is not. For example, in MySQL, there was no UTF-8 support prior to version 4.1. PostgreSQL has had it for some time, but it didn’t work properly until 8.1 and there’s no simple way (for us) to convert a current ISO-8859-1 database into UTF-8. SQLite got its UTF-8 support in version 3 which PunBB doesn’t even support. In other words, it’s a mess. It is likely that if we decide to support UTF-8 on a database level, we will only do so for installs running MySQL 4.1 or later. It should be noted that even with MySQL 4.1 installs, there’s a whole song and dance thing we have to go through in order to do charset conversion in MySQL.

What it breaks down to is this. PunBB 1.3 will support UTF-8 regardless of which database backend it’s running on, but it will support it 100% on setups running on MySQL 4.1 or later. This is still better than most web apps out there.

Cheers,
Rickard