The Umlaut Problem - How To Successfully Back Up And Restore MySQL Databases With Special Characters Using MySQLDumper

Introduction

This article was written by Daniel Schlichtholz, the developer of MySQLDumper - a free and useful script for backing up and restoring databases. The original language was German and I have translated it to English, because there are people using umlauts and other special characters in other languages who would benefit from reading this article. The original article, in German, can be found here.

The Umlaut-Problem:

There are lots of forums with heaps of reports of umlauts or other special characters being displayed erroneously. Attempts to help come from many sides, but hardly anyone seems to have a really complete picture of the problem. Even web hosts' support hotlines seem to reach their limit of knowledge with this one. There are incredible amounts of half-truths. They prove correct in certain circumstances, but they do not help all people. Simply because setups vary so much.

In my role as developer of MySQLDumper I naturally spent a lot of time evaluating this situation. And I now believe to be able to offer a complete overview which covers the different aspects of this problem. A lot, and I mean a lot of research has gone into the explanations in this article.

Be warned though: The matter is not simple and cannot be explained in 2 minutes!

There's a reason why the cries for help in many forums are piling up while hardly anyone can really help. If you're looking for a simple explanation ("Where do I click to make it work?"), don't bother reading any further. This article is long and requires receptivity. What you get in return is, understanding the problem thoroughly.

Finding proper information has been a very difficult task. And so it took some time before I myself finally understood the matter. I want to save you this odyssey - anyone that has dealt with this subject, knows what I'm talking about.
It goes without saying, that I am not the master of all wisdom - the aspects that I will explain here are largely based on experiences I gained by examining all kinds of different servers of many MySQLDumper users (Thank you you for your trust). If you know (of) an aspect that is not mentioned here, I kindly ask you to let me know.

Theory - Why does the umlaut problem occur?

Put simply, it is caused by different MySQL-Server-versions using differing standard character sets.

In the past, when a script (any script) sent or received data to or from a MySQL-Server, that data was always encoded in latin1 (at least it was in Germany). A database-backup-program (e.g. MySQLDumper) receives that data and saves it as an SQL-command in a text file.

So far so good.

But: a text file that was written on a server by a script does not have a usable coding. What this means is, that a script has no way of knowing what character set the data that it puts in the file is encoded in. There are many different character-sets, and so the exact same data can be transferred in many different ways, depending on the character-set that it's encoded in.

Text editors can mark utf8-encoded text files as such by inserting a special identifier at the beginning of the file - a so called BOM (Byte Order Mark - see the wikipedia article on BOM and the official Unicode explanation).

A web server or the PHP interpreter can't read that mark and so it passes the data on to the web browser unchanged - which of course is not what you want it to do. So this way of identifying the character-set falls through and an "unmarked" text file is created. So it is unknown which character-set the file is encoded with. And that is just what the whole problem is all about.

MySQL version 3.x saved and served data in latin1 (because, at least in Germany, usually that is how it was installed). There was nothing a script or user could do about it - the output was always latin1. So a backup file from a MySQL-Server version 3.x always contains data encoded in latin1. Equally it always expects data it receives to be encoded in latin1. During a database restore process of a backup file MySQLDumper reads a file line by line. It recognizes the beginning and end of a MySQL-command, extracts it and sends it to the MySQL-Server unchanged. So in this scenario everything fits well.
The backup file is latin1 and MySQL expects latin1 -> working! In this scenario there are no umlaut problems.

So, if everything worked so wonderfully, why did they start changing things?

There are a lot of language-specific characters which cannot be displayed using a 1-byte-character-set. German umlauts öüä are not found in other character-sets, so that a user (for example) from Japan is going to have trouble with a German website. This may result in umlauts being displayed incorrectly.

This leads to a lot of programming related problems. Just think of multi-lingual programs or multi-lingual websites. Each language variation of the website would have to output all text in the language's respective character-set. Different headers would have to be sent, so that the browser even realizes what language it's supposed to be displaying. In addition the relevant character-set would have to be installed in the browser already. The list of problems associated with that would be long and would present the developers with many complicated problems.

So it would be desirable to be able to output all characters of all languages with only one character-set, to avoid all those problems. That is why the Unicode character-set was developed: See here for the Wikipedia article on Unicode

However, you need more memory-space in order to display the immense amount of possible characters in only one character-set, of course. In order to compensate for the extra memory-load, different storage and display formats were developed. The best known of these is UTF8. UTF stands for "Unicode Transformation Format" and is capable of reducing the necessary memory-space, in letting only some characters require more than one byte to be displayed. Many common characters still only require 1 Byte.

Now we're getting to the practical part!

Since version 4.0 MySQL uses utf8 to store data internally. This, however, does not mean that developers must only create programs that encode data in utf8. Because MySQL can deliver data in any format you want.

All we need to do is tell the server, what format we want it to serve the data in, and of course what format the data we send it is encoded in! To achieve that, programs can send the command SET NAMES latin1 after connecting to the MySQL-Server. Thus the program and the MySQL-Server agree on exchanging data in latin1. When a program sends data to the MySQL-Server MySQL receives the data as latin1, converts it internally to utf8 and stores it correctly. When the script requires data, MySQL loads the data and converts it to latin1 before serving it.

So the communication works perfectly when the two agree on a character-set, because they quite literally speak the same language.

If the program that is connecting to the MySQL-Server does not say what character-set the server's supposed to use, the default character-set is used (which is declared in the MySQL-System-Variable character_set_connection).
And this is exactly where things go wrong, if the data is sent in a format other than the one that the program expects it to.
This is a two-way street:
If the program assumes that it's receiving latin1-encoded data but the server's default character-set is set to utf8, it obviously doesn't quite fit. The same applies when the script sends data to the MySQL-Server.

Most scripts don't (yet) account for this situation of having to declare the character set (including MySQLDumper in the versions up to and including 1.21b6).
And this is where things go wrong, because with some servers the standard character-set is latin1. With others it's utf8 and with others again it's something completely different. Let's look at the practice using MySQLDumper Version 1.21b6:

Scenario 1:

- Transfer from MySQL 3.x to 4.x, with the newer one using latin1 as the standard character-set. - The backup of the MySQL 3.x database is encoded latin1.
- During the database restoration process Dumper 1.21 (unknowingly) sends the data to the MySQL-Server encoded in latin1.
- MySQL receives the data and expects it to be latin1 because nothing else was agreed upon. It then converts the data to utf8 and stores it correctly because, coincidentally the standard character-set matches the character-set of the data it received from Dumper 1.21.
- All is ok. Umlauts and other special characters are working.

Scenario 2:

- Transfer from MySQL 3.x to 4.x, with the newer one using utf8 as the standard character-set.
- The backup of the MySQL 3.x database is encoded latin1.
- During the database restoration process Dumper 1.21 (unknowingly) sends the data to the MySQL-Server encoded in latin1
- MySQL receives the data and thinks "Hey, cool, no special character-set was declared, so I'm getting utf8 encoded data. I don't need to convert it, so I can just store it in the database right away."
- Now MySQL is saving latin1-encoded data as utf8!
- Seeing as the special chars which are encoded latin1 do not have a counterpart in utf8 if they are not converted correctly, they are displayed as question-marks or something equally annoying.
- So it goes wrong here, and special chars are displayed as question marks or a rectangle or whatever.

Scenario 3*:

- Transfer from MySQL >=4.1 to >=4.1, with the old server using utf8 as the standard character-set, and the new one using latin1 as standard character-set.
- So the backup file from the old server is encoded utf8, because when dumping the backup file, no other format was specified, and the old MySQL-Server sent the data in its standard character-set (utf8).
- During the database restoration process Dumper 1.21 (unknowingly) sends the data to the MySQL-Server encoded in utf8.
- MySQL receives and saves the utf8 encoded data as latin1 encoded data.
- Now utf8 requires multiple bytes for special characters, but those same characters are evaluated as 1-byte characters in latin1.
As a result, one special character gets transformed into 2 (or more) letters/characters when displayed.
- So when retrieving the data from the new database and displaying it, umlauts might look like this: Ã¤Ã¶Ã¼ (this equals äöü). The Ä is actually the code which precedes some special characters in utf8, but is not interpreted that way, because the MySQL-Server assumes the data to be encoded latin1. And again it goes wrong here.

*(Edit: for cases such as scenario 3 I have developed a correction program, see: http://www.mysqldumper.de/board/viewtopic.php?p=19187#19187)