When master dies, slave will timeout on server.sh trying to connect to master's MySQL

Discussion in 'General' started by Gwyneth Llewelyn, Jul 10, 2017.

  1. Gwyneth Llewelyn

    Gwyneth Llewelyn New Member

    I've got an ISPConfig3 3.1.5 master/slave solution, configured using this HowToForge tutorial (adapted for Ubuntu 16.04.2 LTS, with php (cli) version 7.0.18-0ubuntu), using MariaDB/MySQL master-master replication (MariaDB is Ver 15.1 Distrib 10.0.29-MariaDB) and unison (2.48.4) to keep the pertinent files up to date. ISPConfig3 was configured with the Web option, so that when the master fails, the whole admin backoffice is still able to function. This is not a 'hot standby' solution but a manually configured 'cold standby' one, where I will only intervene if necessary by changing the relevant DNS entries on Cloudflare to point sites to the slave instead of the master.

    However, as I understand it, this is not a complete/perfect 'fallback' solution, even though there are claims to the contrary. Although the whole point of having MySQL master-master replication is to be able to change anything on one server to have it instantly replicated on the other one, the truth is that the Web backoffice will not work on the slave, at least not without some modifications. To be more precise: when any modification is executed on the slave, and server.sh picks it up via cron, it will try to contact the master MySQL to get access to the server configuration — this happens at server.php, around line 62, after the comment which starts with:

    Try to Load the server configuration from the master-db ​

    ... when it fails, as far as I understand the code, it will assume that slave and master are out of sync and will try to go through all the changes made via the Web backoffice, to 'recover' the database (I'm not quite sure I understand where exactly those changes are recorded).

    Now, when the master is really down for a long time, and the slave is effectively running on its own, this creates a problem: server.sh will endlessly time out, over and over again, being unable to contact the master:

    PHP Warning: mysqli_connect(): (HY000/2002): Connection timed out in /usr/local/ispconfig/server/lib/classes/db_mysql.inc.php on line 85
    [... repeated several times ...]
    Zugriff auf Datenbankserver fehlgeschlagen! / Database server not accessible!
    As far as I can tell from what server.php does, there is a temporary file at /usr/local/ispconfig/server/temp/rescue_module_serverconfig.ser.txt, but I don't think it's pertinent here — its content, in my case, is just:

    a:1:{s:12:"serverconfig";a:2:{s:6:"server";a:1:{s:8:"loglevel";i:2;}s:6:"rescue";a:4:{s:10:"try_rescue";s:1:"n";s:23:"do_not_try_rescue_httpd";s:1:"n";s:23:"do_not_try_rescue_mysql";s:1:"n";s:22:"do_not_try_rescue_mail";s:1:"n";}}}​

    What this means for me, as far as I understand, is the following: while using the slave server as a full backup (while the master is down) certainly works from the perspective of the well-configured users/websites, if there is any problem in the configuration, there is no easy (manual) way to do any changes — for all purposes, the configuration on the slave is 'read-only' in the sense that we can certainly change it on the slave's Web backoffice, but, because it will be unable to contact the master server, such changes will never be applied locally.

    Am I correctly understanding the issue here? Am I right in assuming that a master/slave configuration is meant to have both active, or at the very least, the master must be active if one wants to change the ISPConfig configuration — the slave, by itself, is unable to do so?

    I can understand that perhaps creating a two-way configuration change queue on both master and slave might overly complexify the whole issue, and that's why it doesn't work (although I'm curious to see if the changes made on the slave will be reflected on the master once it comes back up and MySQL gets sync'ed — maybe I will need to force a resync on ISPConfig for that to happen). But on the other hand I might be missing something: after all, I would at least expect some backoffice warning saying something like: 'Warning! You're currently making changes to the ISPConfig3 configuration on the slave server! Please make sure to log in to the correct backoffice before making any changes!' or at least 'Since you are doing changes on the ISPConfig3 configuration on the slave while the master is down, you will need to wait until the master is up again in order for the changes to be applied'.

    Note that in my current situation I fortunately have the vast majority of websites working flawlessly on the slave, and I'm very confident that any changes made on the websites (not the ISPConfig3 overall configuration) will be reflected back on the master once it comes back (I have tested that before, and it works :) ). But this is actually the first time that the master server is really inaccessible for several hours and the first 'real world' test (as opposed to a 'lab experiment') where I can see why some configurations are not working as they ought to, but I cannot fix them (I could do that manually, of course, and let the master overwrite everything once it comes up again, but that defeats the whole purpose of having a redundant, automatic configuration).
     
  2. till

    till Super Moderator Staff Member ISPConfig Developer

    When the master server is down, then changes will be stored in the queue until it is up again and then they get applied automatically. ISPConfig has to rety the connection to find out when MySQL is up again, so it is expected that you get a timeout and there is no problem with that. ISPConfig shows you that the changes were not written to disk yet that you made in the interface on the slave in the meantime. The live configuration of the sites is not affected, so they will function like before.

    These messages would suggest something wrong to the user as he shall not login to another interface when the master is down. And that changes are stored in the queue is shown in ISPConfig (the red blinking icon which shows the number of pending changes and also in the jobqueue).
     

Share This Page