Xen Cluster Management With Ganeti On Debian Lenny - Page 4
11 A Failover Example
Now let's assume you want to take down node2.example.com due to maintenance and you therefore want to fail over inst1.example.com to node1 (please note that inst1.example.com will be shut down during the failover, but will be switched on again instantly thereafter) .
First, let's find out about our instances:
node1:
gnt-instance list
As you see, node2 is the primary node:
node1:~# gnt-instance list
Instance OS Primary_node Status Memory
inst1.example.com debootstrap node2.example.com running 256
node1:~#
To failover inst1.example.com to node1, we run the following command (again on node1):
gnt-instance failover inst1.example.com
node1:~# gnt-instance failover inst1.example.com
Failover will happen to image inst1.example.com. This requires a
shutdown of the instance. Continue?
y/[n]/?: <-- y
* checking disk consistency between source and target
* shutting down instance on source node
* deactivating the instance's disks on source node
* activating the instance's disks on target node
* starting the instance on the target node
node1:~#
Afterwards, we run
gnt-instance list
again. node1 should now be the primary node:
node1:~# gnt-instance list
Instance OS Primary_node Status Memory
inst1.example.com debootstrap node1.example.com running 256
node1:~#
As inst1.example.com has started again immediately after the failover, we need to fix the console problem again (see chapter 9):
gnt-instance shutdown inst1.example.com
gnt-instance startup --extra "xencons=tty1 console=tty1" inst1.example.com
Now you can take down node2:
node2:
shutdown -h now
After node2 has gone down, you can try to connect to inst1.example.com - it should still be running.
Now after the maintenance on node2 is finished and we have booted it again, we'd like to make it the primary node again.
Therefore we try a failover on node1 again:
node1:
gnt-instance failover inst1.example.com
This time we get this:
node1:~# gnt-instance failover inst1.example.com
Failover will happen to image inst1.example.com. This requires a
shutdown of the instance. Continue?
y/[n]/?: <-- y
* checking disk consistency between source and target
Node node2.example.com: Disk degraded, not found or node down
Failure: command execution error:
Disk sda is degraded on target node, aborting failover.
node1:~#
The failover doesn't work because inst1.example.com's hard drive on node2 is degraded (i.e., not in sync).
To fix this, we can replace inst1.example.com's disks on node2 by mirroring the disks from the current primary node, node1, to node2:
node1:
gnt-instance replace-disks -s inst1.example.com
During this process (which can take some time) inst1.example.com can stay up.
node1:~# gnt-instance replace-disks -s inst1.example.com
STEP 1/6 check device existence
- INFO: checking volume groups
- INFO: checking sda on node2.example.com
- INFO: checking sda on node1.example.com
- INFO: checking sdb on node2.example.com
- INFO: checking sdb on node1.example.com
STEP 2/6 check peer consistency
- INFO: checking sda consistency on node1.example.com
- INFO: checking sdb consistency on node1.example.com
STEP 3/6 allocate new storage
- INFO: creating new local storage on node2.example.com for sda
- INFO: creating new local storage on node2.example.com for sdb
STEP 4/6 change drbd configuration
- INFO: detaching sda drbd from local storage
- INFO: renaming the old LVs on the target node
- INFO: renaming the new LVs on the target node
- INFO: adding new mirror component on node2.example.com
- INFO: detaching sdb drbd from local storage
- INFO: renaming the old LVs on the target node
- INFO: renaming the new LVs on the target node
- INFO: adding new mirror component on node2.example.com
STEP 5/6 sync devices
- INFO: Waiting for instance inst1.example.com to sync disks.
- INFO: - device sda: 1.80% done, 560 estimated seconds remaining
- INFO: - device sdb: 12.40% done, 35 estimated seconds remaining
- INFO: - device sda: 5.80% done, 832 estimated seconds remaining
- INFO: - device sdb: 89.30% done, 3 estimated seconds remaining
- INFO: - device sda: 6.40% done, 664 estimated seconds remaining
- INFO: - device sdb: 98.50% done, 0 estimated seconds remaining
- INFO: - device sda: 6.50% done, 767 estimated seconds remaining
- INFO: - device sdb: 100.00% done, 0 estimated seconds remaining
- INFO: - device sda: 6.50% done, 818 estimated seconds remaining
- INFO: - device sda: 19.30% done, 387 estimated seconds remaining
- INFO: - device sda: 32.00% done, 281 estimated seconds remaining
- INFO: - device sda: 44.70% done, 242 estimated seconds remaining
- INFO: - device sda: 57.30% done, 195 estimated seconds remaining
- INFO: - device sda: 70.00% done, 143 estimated seconds remaining
- INFO: - device sda: 82.70% done, 74 estimated seconds remaining
- INFO: - device sda: 95.40% done, 20 estimated seconds remaining
- INFO: - device sda: 99.80% done, 3 estimated seconds remaining
- INFO: Instance inst1.example.com's disks are in sync.
STEP 6/6 removing old storage
- INFO: remove logical volumes for sda
- INFO: remove logical volumes for sdb
node1:~#
Afterwards, we can failover inst1.example.com to node2:
gnt-instance failover inst1.example.com
node2 should now be the primary again:
gnt-instance list
node1:~# gnt-instance list
Instance OS Primary_node Status Memory
inst1.example.com debootstrap node2.example.com running 256
node1:~#
(Now do this again:
gnt-instance shutdown inst1.example.com
gnt-instance startup --extra "xencons=tty1 console=tty1" inst1.example.com
)
12 A Live Migration Example
One of the great Ganeti features is that you can do live migrations of instances, i.e., you can move them from one node to the other without taking them down (live migration works only if you're using DRBD 0.8, it doesn't work with DRBD 0.7).
To migrate inst1.example.com from node2 to node1, we run:
node1:
gnt-instance migrate inst1.example.com
node1:~# gnt-instance migrate inst1.example.com
Instance inst1.example.com will be migrated. Note that migration is
**experimental** in this version. This might impact the instance if
anything goes wrong. Continue?
y/[n]/?: <-- y
* checking disk consistency between source and target
* identifying disks
* switching node node1.example.com to secondary mode
* changing into standalone mode
* changing disks into dual-master mode
* wait until resync is done
* migrating instance to node1.example.com
* switching node node2.example.com to secondary mode
* wait until resync is done
* changing into standalone mode
* changing disks into single-master mode
* wait until resync is done
* done
node1:~#
The command
gnt-instance list
should now show that inst1.example.com is now running on node1:
node1:~# gnt-instance list
Instance OS Primary_node Status Memory
inst1.example.com debootstrap node1.example.com running 256
node1:~#
Let's migrate it back to node2:
gnt-instance migrate inst1.example.com
node1:~# gnt-instance migrate inst1.example.com
Instance inst1.example.com will be migrated. Note that migration is
**experimental** in this version. This might impact the instance if
anything goes wrong. Continue?
y/[n]/?: <-- y
* checking disk consistency between source and target
* identifying disks
* switching node node2.example.com to secondary mode
* changing into standalone mode
* changing disks into dual-master mode
* wait until resync is done
* migrating instance to node2.example.com
* switching node node1.example.com to secondary mode
* wait until resync is done
* changing into standalone mode
* changing disks into single-master mode
* wait until resync is done
* done
node1:~#
gnt-instance list
node1:~# gnt-instance list
Instance OS Primary_node Status Memory
inst1.example.com debootstrap node2.example.com running 256
node1:~#
13 Creating A Backup Of An Instance
To create a backup of inst1.example.com on node1, we run (the instance will be shut down during this operation!):
node1:
gnt-backup export -n node1.example.com inst1.example.com
The backup will be stored in the /var/lib/ganeti/export/inst1.example.com/ directory:
ls -l /var/lib/ganeti/export/inst1.example.com/
node1:~# ls -l /var/lib/ganeti/export/inst1.example.com/
total 108788
-rw-r--r-- 1 root root 111279899 2009-02-26 17:30 9c923acc-14b4-460d-946e-3b0d4d2e18e6.sda_data.snap
-rw------- 1 root root 391 2009-02-26 17:30 config.ini
node1:~#
To export the backup to another cluster node, e.g. node3, we run
gnt-backup import -n node3.example.com -t drbd --src-node=node1.example.com --src-dir=/var/lib/ganeti/export/inst1.example.com/ inst1.example.com
14 Masterfailover
Now let's assume our cluster master, node1, has gone down for whatever reason. Therefore we need a new master. To make node2 the new cluster master, we run the following command on node2:
node2:
gnt-cluster masterfailover
node2:~# gnt-cluster masterfailover
caller_connect: could not connect to remote host node1.example.com, reason [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectError'>: An error occurred while connecting: 113: No route to host.
]
could disable the master role on the old master node1.example.com, please disable manually
caller_connect: could not connect to remote host node1.example.com, reason [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectError'>: An error occurred while connecting: 113: No route to host.
]
caller_connect: could not connect to remote host node1.example.com, reason [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectError'>: An error occurred while connecting: 113: No route to host.
]
node2:~#
Now run
gnt-cluster getmaster
to verify that node2 is the new master:
node2:~# gnt-cluster getmaster
node2.example.com
node2:~#
Now when node1 comes up again, we have a split-brain situation - node1 thinks it is the master...
node1:
gnt-cluster getmaster
node1:~# gnt-cluster getmaster
node1.example.com
node1:~#
... while in fact node2 is the master.
To fix this, we edit /var/lib/ganeti/ssconf_master_node on node1:
node1:
chmod 600 /var/lib/ganeti/ssconf_master_node
vi /var/lib/ganeti/ssconf_master_node
node2.example.com |
chmod 400 /var/lib/ganeti/ssconf_master_node
Afterwards,...
gnt-cluster getmaster
... shows the right master:
node1:~# gnt-cluster getmaster
node2.example.com
node1:~#
To make node1 the master again, just run
gnt-cluster masterfailover
on node1 - if both node1 and node2 are running during this operation, both will know that node1 is the new master afterwards.
15 Links
- Ganeti: http://code.google.com/p/ganeti
- Xen: http://xen.xensource.com
- DRBD: http://www.drbd.org
- LVM: http://sourceware.org/lvm2
- Debian: http://www.debian.org