Howdy,
I’ll start my first post to the dev blog with a little something on UTF-8 in PunBB 1.3. For those of you who are not aware of what UTF-8 is, here’s the short story.
UTF-8, short for 8-bit Unicode Transformation Format, is a variable-length character encoding for Unicode. What this means is that one character can be represented by one, two, three or four bytes depending on what type of character it is. The first 127 characters are encoded “as themselves” and are identical to the characters of the character encoding US-ASCII. What this means is that for regular ASCII encoded text (essentially all text written in English), there’s no difference. Show someone a piece of text that includes only the first 127 characters and they won’t be able to tell you if it’s ASCII or UTF-8. However, when we need to use other characters, for example characters with accent marks or characters from the Cyrillic alphabet, UTF-8 uses more than one byte. Although UTF-8 can use up to four bytes, it is relatively uncommon. Pretty much every character you’ve ever seen can be encoded using one to three bytes.
So, what the point of using UTF-8 then? Well, in order for a computer program, for example a browser, to be able to correctly display characters, it needs to know how the text is encoded. We tell the browser how the text is encoded via an HTTP header or via the outdated use of a META tag. So, if we wanted to display Swedish text on a page, we’d tell the browser that the text is encoded using ISO-8859-1. This works just great, but what if we wanted to display Russian, Chinese and Swedish on the same page? In that case, we’d have to encode the Russian and Chinese characters using HTML numeric character references (not good) or we could switch to using a Unicode encoding, e.g. UTF-8. Problem solved. Once we’ve switched to UTF-8, we need not worry about character encodings ever again. A bold statement, but I believe it to be true.
After that rather lengthy intro, I thought I’d talk about how this relates to PunBB and some thoughts on how we should migrate PunBB to UTF-8. Up until now, PunBB has relied on non-Unicode character encodings to display for example Swedish, Russian and Chinese. This causes all kinds of problems and it’s something we definitely want to move away from in PunBB 1.3. Moving to UTF-8 is not trivial though. There are many things to consider and lots of caveats along the way.
The first step is to change the encoding. To be on the safe side, we’ll instruct the browser that the content is UTF-8 both by setting the META Content-Type tag and sending an HTTP header. What this means is that the browser will interpret the page as UTF-8 and consequently, any text sent back to the server will be UTF-8 encoded by the browser. This will make sure that any new content will be UTF-8. It won’t, however, automagically convert existing content to UTF-8 (people upgrading from 1.2). For these people, non-ASCII characters that are already in the database will display as question marks.
In order to deal with this, we have to somehow convert all the content we have stored in the database from the character encoding it’s using right now to UTF-8. In most cases, the current encoding will be ISO-8859-1, but there are over 40 language packs for PunBB and some of them use other encodings (for example WINDOWS-1253 for Greek). I’ve been thinking about this these last few days and unfortunately, I don’t think there’s an easy way to do this. The simplest way is to take a dump of the database, run “iconv -f WINDOWS-1253 -t UTF-8 file1.sql > file2.sql” and then re-import the dump. For ISO-8859-1 users, we could provide a script that grabs each and every text and varchar field from the database, runs PHP’s utf8_convert() on it and then writes it back to the database. This might take a while on a big database though. I’m all ears for any suggestions you people might have regarding this. Just leave a comment.
Provided we solve the problem with converting current content to UTF-8, there’s still a few things to consider. Most web applications I’ve encountered (this blog software included) store UTF-8 text in the database without the database being aware of that the text is UTF-8. In the case of this blog, the database thinks the text is ISO-8859-1 (or Latin-1 as the MySQL folks like to call it) because that’s the default character encoding in this particular MySQL installation. This isn’t a problem if we only store English text in the database. It does turn into a problem if we decide to store for example Swedish. Even worse if we start posting in Russian. The reason this is a problem is that for Russian, each character is represented using two bytes as opposed to one. If we post the Cyrillic character Ж in this blog, the database will believe we posted two separate ISO-8859-1 characters. Now, what if we want to have our database return the first 20 characters in a post that is written in Russian. Since it believes the text is encoded using ISO-8859-1, it will just fetch the first 20 bytes when if fact, it should have fetched the first 40 bytes (Cyrillic characters are two bytes in UTF-8). Another example is if someone registered in the forums with the username Åsnan (Swedish for The donkey). The letter Å will be written to the database as two bytes. Now, if we go to the user list which by default sorts the list by username, we’ll notice that Åsnan won’t be positioned where he should be (between the letters Z and Ä in the Swedish alphabet). The problem is that since the database thinks the text is ISO-8859-1, it sorts it like ISO-8859-1. This isn’t the whole truth as there is something called collations to take into account, but you get the point.
The ideal would be to instruct the database that the content we’re sending it is UTF-8. In some cases, this is doable. In others, it is not. For example, in MySQL, there was no UTF-8 support prior to version 4.1. PostgreSQL has had it for some time, but it didn’t work properly until 8.1 and there’s no simple way (for us) to convert a current ISO-8859-1 database into UTF-8. SQLite got its UTF-8 support in version 3 which PunBB doesn’t even support. In other words, it’s a mess. It is likely that if we decide to support UTF-8 on a database level, we will only do so for installs running MySQL 4.1 or later. It should be noted that even with MySQL 4.1 installs, there’s a whole song and dance thing we have to go through in order to do charset conversion in MySQL.
What it breaks down to is this. PunBB 1.3 will support UTF-8 regardless of which database backend it’s running on, but it will support it 100% on setups running on MySQL 4.1 or later. This is still better than most web apps out there.
Cheers,
Rickard
Interesting post :)
February 12th, 2007, at 1:17 pm #But isn’t the UTF-8 problem also in having to use the mbstring functions? It isn’t an extension that’s installed standard into PHP, and would be yet another required prerequisite for PunBB.
Well, that isn’t so much a problem because we can replicate the functions we need in PHP (for example strlen). We will not rely on mbstring().
February 12th, 2007, at 1:45 pm #[...] We just started up the PunBB DevBlog. Check it out if you’re interested in the development of PunBB. To get thing started, Connor Dunn posted a bit on the new post moderation queue feature in PunBB 1.3. I joined in and talked about making the switch to UTF-8. [...]
February 12th, 2007, at 2:07 pm #Nice to see this moving forward. It’s a little bit of future proofing, and PHP6 should provide most of the UTF8 functions that you may need to emulate for now.
February 12th, 2007, at 3:41 pm #I was hoping to here about this when I saw that you made a blog. Thanks for posting!
February 12th, 2007, at 6:32 pm #First, a comment on the blog. The comments fields icons are a little hard to understand, and quite hard to distinguish on a big screen. A text version would be nice.
Then, about the utf-8 thingie. First of all, it’s a *GREAT* thing. PunBB 1.2.x could work with utf-8 just fine (with very, very minor glitches), but an official and default support is great.
One thing to remember… utf-8 is all you describe, but it has also several advantages not covered in your b’log:
1. Even on a mono-language website, one might have to write “strange” characters. For example, in English, French, Farsi, Italian or German one might want to quote the musician Antonín Dvorak. Well, his name is not Dvorak, it’s Dvořák. Ifit was your name misspelled, would be ok with that?
2. On some language (French is one of them), *no* charset can handle the whole French needed glyphes. Latin1 is obsolete and wasn’t able to, his successor Latin9 is a little more up to date but still not able to write any character needed in French. Granted those are character most French people don’t use (because they don’t know how to write correctly, thanks for the typewriter and dumb old computer system), but still, computers should help us, not hinder us.
3. utf-8 is overall good for developers, webmaster, and so on. Because most people handle several softwares, and having one not supporting utf-8 is a pain in the ass. Try to build a website based on a Latin1 CMS, a utf-8 Wiki, and a Latin9 forum… you’ll see what I mean. PunBB is a essential piece of the FLOSS web software universe, having it switching to unicode (and rejoin the ranks of Drupal, Textpattern, Mediawiki, and the others) was a great news.
About the code side of things. In my humble opinion, the first thing to do is to have utf-8 supported, and as a default. Conversion can be done later (well, before the final release would be nice, but still).
A MySQL>=4.1 is not needed for practical utf-8 support. For example, both Textpattern and Mediawiki have handled utf-8 only for several years, on MySQL 3.x databases (I don’t know at all sqlite and Postgre). Yes the strings recorded were gibberish in SQL, but it worked fine on the web side. And without a huge code base at all to support it.Still, having 4.1 or newer as a requirement is not too hard to ask nowadays, 4.1 has been out for years now, web hosts companies should have it on (good ones do, even cheap good ones).
For the conversion, I would call on the community. Maybe two (or one multi-OS) little tools (Windows and Unix—now that MacOS is a Unix-like) can handle the job for people without good shared hosting, or dedicated hosting, or technical skills. It will allow more control, more freedom, and it could be written and supported outside the PunBB codebase.
February 14th, 2007, at 8:40 am #One thing though. With full utf-8 support, some username sanitize may be needed to avoid user-phishing.
Or maybe just some hooks at username registration for 1.3.0, see what kind of extensions are done in that area, and then draw from them for 1.3.1?
February 15th, 2007, at 9:41 am #Excellent! I find it rather rediculous that PHP and MySQL have taken so extremely long time getting on the Unicode bandwagon. It was inevitable, but still they hesitated for years and years and even more years. And now when the need for UTF-8 is so screaming and gleaming it almost hurts, the problem with migrating to it lies in all developers using PHP and MySQL, like you PunBB developers. I don’t envy you! But I think you’re doing what’s right and a terrific job at it too.
I don’t have any genious idea on how you should handle this, but what you could do is create the SQL conversion script from the database (should go pretty quick) so that the user can execute this script in his MySQL admin interface (phpMyAdmin, command line or whatever). Executing this script might take a while, but at least you’d leave the option to the user on how he wants to execute it. You may of course offer to execute it for him, but you should probably warn that it is going to take a while. Remember to set a long timeout on the PHP script that’s going to do this! :-)
The actual conversion process can be done by exporting all of the data to an SQL script, dropping all of the database tables, running iconv on the export to get it all to UTF-8, then importing the converted file to create all tables and columns (in UTF-8 if possible) and inserting all the data.
February 21st, 2007, at 11:19 am #Jérémie: You’re probably right regarding usernames. There will be a hook somewhere in the registering process and we’ll let the community come up with a plan for dealing with that.
Asbjørn: I’m currently working on an extension of the db_update script that will process the database and do the conversion. It will work if the current encoding is iso-8859-1 or if the PHP extension iconv() is available (PHP5 and later) or if the PHP extension mb_string is available. If neither of these are true, we’ll instruct the user to dump the database and run iconv on it manually. The conversion script will work like the search re-indexing script (e.g. process X amounts of entries, redirect, process X more etc).
February 21st, 2007, at 12:01 pm #How about providing the iconv conversion through punbb.org somehow? You could even do it automatically and remotely by calling a PHP script on punbb.org from the db_update script. If you do it in a similar way to how the conversion script works, by splitting up the whole task in several smaller tasks, you don’t have to think about timeouts either. To protect against DoS attacks on this punbb.org-hosted iconv script, you should probably do some kind of authentication plus a implement a repeat-attack protection. Nothing fancy, but it’s better to be on the safe side. :)
February 21st, 2007, at 11:20 pm #I’m not sure that’s such a good idea. If anything, we’ll link to iconv binaries for various platforms with very clear instructions on how to use it.
February 22nd, 2007, at 8:31 am #Another option is to use the following pure PHP conversion library if iconv isn’t available:
http://mikolajj.republika.pl/
February 22nd, 2007, at 10:10 am #Yeah, maybe. I have never heard of this particular library though. Not sure how well it works.
February 22nd, 2007, at 10:47 am #I’ve heard good things about it. Worth a shot, perhaps?
February 23rd, 2007, at 7:45 am #[...] Switching to UTF-8 [...]
September 6th, 2007, at 3:09 am #The truth is, you aren’t as committed to UTF-8 and internationalization like you make it seem in this blog post. Especially you Jérémie.
March 5th, 2008, at 3:53 am #Reality: Please inform us how we can commit more to UTF-8 and internationalization. I’m dying to know.
March 5th, 2008, at 12:34 pm #