The Umlaut Problem - How To Successfully Back Up And Restore MySQL Databases With Special Characters Using MySQLDumper - Page 3

More information about MySQLDumper

Based on the described experiences and facts, I hope to achieve the following solution:

  • MySQLDumper (as well as you) needs to know, what character-set the backup file was saved as. Only then can it tell whether the backup is encoded in the same character-set as the new MySQL-Server expects it to be. It would then calibrate the connection character-set.
  • In order to achieve this, MySQLDumper will (in the future) write the character-set name as a comment in the backup file while creating the backup.- When restoring the backup file to a database, it will then check for the character-set that was used and will compare that to the character-set of the new MySQL-Server.
  • With MySQL 4.1 and newer it's possible to tell the server, what character-set the data which it's receiving is encoded in.
  • With MySQL versions prior to 4.1 that is not possible and so in such cases it's up to the user to save the backup file in the required character-set.
    And anyway, there's more than only that reason why I strongly recommend to use at least version 4.1. Anything older than 4.1 is too old and any web host offers you the possibility to switch to a newer MySQL server. (If yours does not, then you still should switch. Switch to another web host.)
  • This applies to backups made with other programs too - the user has to specify the character-set because it's impossible for the program to know.
  • I will attempt to implement some kind of automatic check for umlauts and character-set. Whether that is going to work out, I can't say yet.
  • PHP-modules such as mbstring, which would be able to recognize the character-set of a file, do exist. However, they don't always work reliably. And expecting a server to have certain modules in order for MySQLDumper to work, does not fit in our concept of MySQLDumper. So this solution can't be considered.

This way the Dumper would remain pretty much automatic (the user needn't make any setting changes with the Dumper knowing what to do already), and it would still be very flexible and compatible with backup files created with other scripts / programs.

I hope that this helps make things easier for the user. I will take on this problem and add these functions to MySQLDumper during my next holiday. At the moment the only solution is to manually adjust the character-set of the file.

 

One (second to) last thing

I have ignored MySQL-Version 4.0.x in this article. Utf8 was introduced in this version, but only rudimentarily and it was in an out and out test-phase. I have experienced an incredible amount of problems with MySQL 4.0.x - Servers. Some of them didn't behave in accordance with the documentation and caused me to grow quite a few gray hairs. The official website states:

Because version 4.0.* of MySQL Server are in such low demand we have decided to stop hosting binaries of these older versions.

And there doesn't seem to be any documentation for these versions any more. I however believe that these versions were simply still too buggy and those problems were solved in version 4.1 and later. So MySQL deliberately do not offer this version for download, because it just caused too many problems in regards to character-sets / encoding.

So, should you be on a server still running MySQL 4.0.x, pester your web host until they upgrade to at least version 4.1. Version 4.0.x carries a whole lot of problems, that in some cases cannot really be solved by the user when developing a script / program. This at least has been my experience.

 

Closing words

Hopefully now you have learned that, as a responsible admin you should know the character-set of your backups. Only then do you know for sure that your data can be restored properly if need be. Considering the amount of information needed to correctly judge the cause of a problem (What was the version of the source server? What format was used when saving the backup file?), it becomes obvious why so many attempts to help (without knowing that information) are doomed to fail. It's the same for me too. (Just for the sake of fun, check out the help-threads in some forums - virtually nobody asks what character-set the backup file is saved in. Everyone's trying to help, but they can't without that information. Everyone's trying around, but usually without getting anywhere.)

But if you have read all of this carefully you are now armed with the necessary background knowledge and tools to accomplish most any kind of server-change. Having spent hours, typing all of this to a point of aching fingers, I hope you know the facts now. Do something with that knowledge.

I wish you forever-successful backups!

Share this page:

2 Comment(s)

Add comment

Please register in our forum first to comment.

Comments

By: Daniel Schlichtholz (DSB)

Hi Stefan,

<<There's a wrong assumption right at the beginning:
<<"Since version 4.0 MySQL uses utf8 to store data internally."
<<This is not true. What's true is this:
<<"mysqldump from MySQL 4.1.2 or later uses utf8, and earlier versions use latin1."

Yes, there you got me. This information is wrong and I corrected my article on my homepage. Here I can't do this because I didn't post it here. Thank you for your correction.

<<There are more wrong assumptions in this article, and generally I find it more confusing than helpful.

Well, I wrote this article 2 years ago. At this time there was nearly no useful information about this topic available. I tried to point out that scipts now have to "talk to MySQL" in a different way and need to specify the charset they expect by sending "SET NAMES", right after the connection is made. Many, many programms didn't and still don't do this and therefor don't get the control about what encoding is delivered by MySQL and don't know which strings are placed on the html site. In many boards there are heaps of cries for help because so many users do have problems with their board or shop when they move from one host to another.

I tried to describe what is the reason for that and what one can do to get things sorted. From the very postive feedback I got, I don't think that my article is confusing.

 <<The mysqldump client program takes into consideration a whole bunch of character set-related settings

In theory yes, but in practise there were a lot of situations where even mysqldump itself was not able to restore its own backups - sometimes even on the same server where the backup was made - because e.x. vars like @SQL-MODE are set to null. I don't want to get here in detail because actually I don't have an case with which I could proof this. But I can tell you, that I had heaps of help calls where this was the case. I often could fix this by deleting the condition-lines like /*40101 SET .. */.

About the documentation:
I was looking for the information where I could find a listing what charset was added in version 4.0 and what encoding was removed in version 4.1. E.x. in version 4.0 there is a charset called "german1". I can't find this charset in version 4.1. So MySQl 4.1. is not backwards compatible to version 4.0. Try to restore a backup encoded in german1 on a MySQL server of version 4.1  and you will know what I mean. :)

<<By the way, looking at that page will make you aware that there is in fact documentation available for MySQL 4.0, and even 3.23, while the article says there isn't.

The documentation now mixes all these versions together. When reading it you often don't know for which version the documentation is valid.

Or try the german link for the documentation of "set names": http://dev.mysql.com/doc/refman/5.1/de/
Although there is a link on this page for the "MySQL 4.0 Reference Manual" the link also points to the 5.1-site. My conclusion was that these version are no more supported. Now I see, that the english documenation is much more detailed. But still version 3.23, 4.0 and 4.1 are mixed. There are so many differences between version 4.0. and 4.1 from the view of a programmer, that this kind of documentaion is not sufficient.

Nevertheless: don't get me wrong! I love MySQL and its possibilities and I do my very best to help others to understand how to do working backups on shared hostings without having access to mysqldump.

But my conclusion still is: I want to do everything I can do, to make people update their MySQL verion 4.0 server to at least version 4.1 or greater.

Maybe this is the wrong place to discuss this. I invite you to come to the english section of my support board of MySQLDumper to continue this technical talk (I like :) ). http://forum.mysqldumper.de/index.php?c=6

Best regrads from Germany,
Daniel Schlichtholz (DSB)

By: Stefan Hinz

There's a wrong assumption right at the beginning:

"Since version 4.0 MySQL uses utf8 to store data internally."

This is not true. What's true is this:

"mysqldump from MySQL 4.1.2 or later uses utf8, and earlier versions use latin1."

But that's from the MySQL Manual. ;-) See: http://dev.mysql.com/doc/refman/4.1/en/mysqldump.html

There are more wrong assumptions in this article, and generally I find it more confusing than helpful.

The mysqldump client program takes into consideration a whole bunch of character set-related settings when backing up data, and it does that by default, as can be seen in this example:

aphrodite~> mysqldump test > testdump.sql
aphrodite~> cat testdump.sql
-- MySQL dump 10.13  Distrib 5.1.26-rc, for redhat-linux-gnu (i686)
--
-- Host: localhost    Database: test
-- ------------------------------------------------------
-- Server version       5.1.26-rc-log

/*!40101 SET @[email protected]@CHARACTER_SET_CLIENT */;
/*!40101 SET @[email protected]@CHARACTER_SET_RESULTS */;
/*!40101 SET @[email protected]@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8 */;
...

When restoring data backed up this way to a MySQL server running 4.1.01 or later, this should work with no issues "by default", so to speak.

Of course, there are more things that could go wrong. For example, the original data might be from a table that has a latin1 character set, or no explicit character set at all (MySQL prior to 4.1), and the target table might have UTF-8. This would still work one-way for many characters (including German umlauts, probably), but would certainly fail the other way around (when restoring to a MySQL server < 4.1).

For more information (including many examples), please refer to the Character Set Support section in the Internationalization and Localization chapter of the MySQL Reference Manual: http://dev.mysql.com/doc/refman/4.1/en/charset.html

By the way, looking at that page will make you aware that there is in fact documentation available for MySQL 4.0, and even 3.23, while the article says there isn't. Also, the article says that MySQL 4.0 charset behavior differs from what's documented -- if that's true, please say what those differences are, and file a bug against the documentation. Thanks!

Regards,

Stefan (Sun Database Group documentation team)