The Umlaut Problem - How To Successfully Back Up And Restore MySQL Databases With Special Characters Using MySQLDumper - Page 2

Practical help

The attentive reader will have noticed that the restore process can only be completed successfully if the backup file uses the same character-set that the new server expects it to. But how do I know what a MySQL-Server's standard character-set is? Thankfully, MySQLDumper gives us comprehensive information! If you check Home / MySQL-Variables / Variables you'll find an array of MySQL-settings:
e.g.:

Of these settings the two that are interesting for us are character_set_client and character_set_connection which are usually identical.

These settings tell us what character-set is expected of the "client" when nothing else is declared. The "client" is in this case the program/script which connects to the MySQL-Server.
We can safely ignore the other settings - no matter what they are.

MySQL converts all data to the proper character-set if that's necessary. We don't need to worry about what standard character-set the database has or how MySQL saves the data internally (character_set_system). What MySQL goes by, when receiving or serving data, is the character-set that was agreed upon connection (or the server's standard character-set). It then converts the data internally if it needs to. So we needn't worry about anything else. If character_set_connection is set to utf8, then we get utf8 encoded data and we have to send utf8 encoded data.
That's it. Done.

Let me take this opportunity to set another thing straight: the collation setting does not have any effect at all when sending data to the server. It only has a bearing when receiving data from the server. When a query is sorted using the "ORDER BY" - clause, the collation setting controls what criteria are applied. Check out the official MySQL documentation for more information. Was this too theoretical? Okay, here's an example:

Let's assume we have a table with 3 rows (or data sets)

ID Name (this is not a row, just the name of the columns)
1 Zapp, Frank
2 Zäppel, Ernst
3 Zander, Bernd

and we output these columns using SELECT * FROM `table` ORDER BY `Name`

Collation latin1_german1_ci (dictionary sorting - umlauts are sorted alphabetically) returns:

3 Zander, Bernd
1 Zappa, Frank
2 Zäppel, Ernst

Collation latin1_german2_ci (phonebook sorting - umlauts first) return:

2 Zäppel, Ernst
3 Zander, Bernd
1 Zappa, Frank

So you see you can control the country-specific collation in order to adjust sorting of data sets.
All of that has nothing to do with the character-sets or the umlaut-problem. So if you read a message in a forum telling you to adjust the collation when having umlaut problems, you can ignore that. It's a matter of the software you're using and you needn't concern yourself with that. But let's get back to topic.

So what do I do if MySQLDumper tells me that the character-sets of the old and new server don't match? You already have an important advantage: You know what character-set the new MySQL-Server expects!
So all you need to do is send it the data using that character-set. In order to do this, you need a text editor, that can save text files in different formats (without using BOM - see above). My favorite editor for doing this is TextPad (non-free): (Translator's comment: There are free alternatives to TextPad, such as NotePad++.)

So, what's to be done?
Download the backup file you have just created with MySQLDumper. If the backup file is compressed using GZ compression (file-extension is *.sql.gz), unpack it. Make sure you use a program that can unpack the data correctly. You may run into problems if you use some older versions of WinRar for example.

Open the file in TextPad (or NotePad++) and click File / Save As. In the lower part of the menu you can change the character-set by changing the Encoding setting. Default equals latin1. Having saved the file in the correct character-set, you can (for example) use 7Zip (free) to gzip it, upload it to the server and then use MySQLDumper to feed the data into the (new) database.

And that's it.

So what do we do in order to improve MySQLDumper in regards to this?
Well, I've been thinking about this long and hard. There's another database management tool, that requires the user to manually select the character-set / encoding. This was not a solution that I would consider, because in 99% of cases the user wouldn't know what to choose or what the consequences of a wrong choice may be. (Well, maybe they do know now after reading this article. *g*)

MySQLDumper has gained a reputation as being easy and safe to use and I want it to stay that way.
Starting with version 1.21b12 I'm beginning to use utf8 as the standard character-set / encoding. This setting works with MySQL-Servers version 4.1 and newer if the backup was made using MySQLDumper as well. If the backup was done using another program and was not saved in the utf8-format, the restore process won't work out because of the reasons explained above.
Version 1.21b13 leaves it up to the user to explicitly set the character-set / encoding to utf8. But again, it all depends on what character-set / encoding the MySQL-Server expects and what character-set / encoding the backup file uses. If the standard character-set is utf8 anyway, then it doesn't matter whether the box is ticked or not.

So in this case MySQLDumper does not offer a perfect and always-working solution. And that's something we want to change of course.