Xen Cluster Management With Ganeti On Debian Lenny - Page 4

11 A Failover Example

Now let's assume you want to take down node2.example.com due to maintenance and you therefore want to fail over inst1.example.com to node1 (please note that inst1.example.com will be shut down during the failover, but will be switched on again instantly thereafter) .

First, let's find out about our instances:

node1:

gnt-instance list

As you see, node2 is the primary node:

node1:~# gnt-instance list
Instance          OS          Primary_node      Status  Memory
inst1.example.com debootstrap node2.example.com running    256
node1:~#

To failover inst1.example.com to node1, we run the following command (again on node1):

gnt-instance failover inst1.example.com

node1:~# gnt-instance failover inst1.example.com
Failover will happen to image inst1.example.com. This requires a
shutdown of the instance. Continue?
y/[n]/?:
<-- y
* checking disk consistency between source and target
* shutting down instance on source node
* deactivating the instance's disks on source node
* activating the instance's disks on target node
* starting the instance on the target node
node1:~#

Afterwards, we run

gnt-instance list

again. node1 should now be the primary node:

node1:~# gnt-instance list
Instance          OS          Primary_node      Status  Memory
inst1.example.com debootstrap node1.example.com running    256
node1:~#

As inst1.example.com has started again immediately after the failover, we need to fix the console problem again (see chapter 9):

gnt-instance shutdown inst1.example.com
gnt-instance startup --extra "xencons=tty1 console=tty1" inst1.example.com

Now you can take down node2:

node2:

shutdown -h now

After node2 has gone down, you can try to connect to inst1.example.com - it should still be running.

Now after the maintenance on node2 is finished and we have booted it again, we'd like to make it the primary node again.

Therefore we try a failover on node1 again:

node1:

gnt-instance failover inst1.example.com

This time we get this:

node1:~# gnt-instance failover inst1.example.com
Failover will happen to image inst1.example.com. This requires a
shutdown of the instance. Continue?
y/[n]/?:
<-- y
* checking disk consistency between source and target
Node node2.example.com: Disk degraded, not found or node down
Failure: command execution error:
Disk sda is degraded on target node, aborting failover.
node1:~#

The failover doesn't work because inst1.example.com's hard drive on node2 is degraded (i.e., not in sync).

To fix this, we can replace inst1.example.com's disks on node2 by mirroring the disks from the current primary node, node1, to node2:

node1:

gnt-instance replace-disks -s inst1.example.com

During this process (which can take some time) inst1.example.com can stay up.

node1:~# gnt-instance replace-disks -s inst1.example.com
STEP 1/6 check device existence
 - INFO: checking volume groups
 - INFO: checking sda on node2.example.com
 - INFO: checking sda on node1.example.com
 - INFO: checking sdb on node2.example.com
 - INFO: checking sdb on node1.example.com
STEP 2/6 check peer consistency
 - INFO: checking sda consistency on node1.example.com
 - INFO: checking sdb consistency on node1.example.com
STEP 3/6 allocate new storage
 - INFO: creating new local storage on node2.example.com for sda
 - INFO: creating new local storage on node2.example.com for sdb
STEP 4/6 change drbd configuration
 - INFO: detaching sda drbd from local storage
 - INFO: renaming the old LVs on the target node
 - INFO: renaming the new LVs on the target node
 - INFO: adding new mirror component on node2.example.com
 - INFO: detaching sdb drbd from local storage
 - INFO: renaming the old LVs on the target node
 - INFO: renaming the new LVs on the target node
 - INFO: adding new mirror component on node2.example.com
STEP 5/6 sync devices
 - INFO: Waiting for instance inst1.example.com to sync disks.
 - INFO: - device sda:  1.80% done, 560 estimated seconds remaining
 - INFO: - device sdb: 12.40% done, 35 estimated seconds remaining
 - INFO: - device sda:  5.80% done, 832 estimated seconds remaining
 - INFO: - device sdb: 89.30% done, 3 estimated seconds remaining
 - INFO: - device sda:  6.40% done, 664 estimated seconds remaining
 - INFO: - device sdb: 98.50% done, 0 estimated seconds remaining
 - INFO: - device sda:  6.50% done, 767 estimated seconds remaining
 - INFO: - device sdb: 100.00% done, 0 estimated seconds remaining
 - INFO: - device sda:  6.50% done, 818 estimated seconds remaining
 - INFO: - device sda: 19.30% done, 387 estimated seconds remaining
 - INFO: - device sda: 32.00% done, 281 estimated seconds remaining
 - INFO: - device sda: 44.70% done, 242 estimated seconds remaining
 - INFO: - device sda: 57.30% done, 195 estimated seconds remaining
 - INFO: - device sda: 70.00% done, 143 estimated seconds remaining
 - INFO: - device sda: 82.70% done, 74 estimated seconds remaining
 - INFO: - device sda: 95.40% done, 20 estimated seconds remaining
 - INFO: - device sda: 99.80% done, 3 estimated seconds remaining
 - INFO: Instance inst1.example.com's disks are in sync.
STEP 6/6 removing old storage
 - INFO: remove logical volumes for sda
 - INFO: remove logical volumes for sdb
node1:~#

Afterwards, we can failover inst1.example.com to node2:

gnt-instance failover inst1.example.com

node2 should now be the primary again:

gnt-instance list
node1:~# gnt-instance list
Instance          OS          Primary_node      Status  Memory
inst1.example.com debootstrap node2.example.com running    256
node1:~#

(Now do this again:

gnt-instance shutdown inst1.example.com
gnt-instance startup --extra "xencons=tty1 console=tty1" inst1.example.com

)

 

12 A Live Migration Example

One of the great Ganeti features is that you can do live migrations of instances, i.e., you can move them from one node to the other without taking them down (live migration works only if you're using DRBD 0.8, it doesn't work with DRBD 0.7).

To migrate inst1.example.com from node2 to node1, we run:

node1:

gnt-instance migrate inst1.example.com

node1:~# gnt-instance migrate inst1.example.com
Instance inst1.example.com will be migrated. Note that migration is
**experimental** in this version. This might impact the instance if
anything goes wrong. Continue?
y/[n]/?:
<-- y
* checking disk consistency between source and target
* identifying disks
* switching node node1.example.com to secondary mode
* changing into standalone mode
* changing disks into dual-master mode
* wait until resync is done
* migrating instance to node1.example.com
* switching node node2.example.com to secondary mode
* wait until resync is done
* changing into standalone mode
* changing disks into single-master mode
* wait until resync is done
* done
node1:~#

The command

gnt-instance list

should now show that inst1.example.com is now running on node1:

node1:~# gnt-instance list
Instance          OS          Primary_node      Status  Memory
inst1.example.com debootstrap node1.example.com running    256
node1:~#

Let's migrate it back to node2:

gnt-instance migrate inst1.example.com

node1:~# gnt-instance migrate inst1.example.com
Instance inst1.example.com will be migrated. Note that migration is
**experimental** in this version. This might impact the instance if
anything goes wrong. Continue?
y/[n]/?:
<-- y
* checking disk consistency between source and target
* identifying disks
* switching node node2.example.com to secondary mode
* changing into standalone mode
* changing disks into dual-master mode
* wait until resync is done
* migrating instance to node2.example.com
* switching node node1.example.com to secondary mode
* wait until resync is done
* changing into standalone mode
* changing disks into single-master mode
* wait until resync is done
* done
node1:~#

gnt-instance list
node1:~# gnt-instance list
Instance          OS          Primary_node      Status  Memory
inst1.example.com debootstrap node2.example.com running    256
node1:~#

 

13 Creating A Backup Of An Instance

To create a backup of inst1.example.com on node1, we run (the instance will be shut down during this operation!):

node1:

gnt-backup export -n node1.example.com inst1.example.com

The backup will be stored in the /var/lib/ganeti/export/inst1.example.com/ directory:

ls -l /var/lib/ganeti/export/inst1.example.com/
node1:~# ls -l /var/lib/ganeti/export/inst1.example.com/
total 108788
-rw-r--r-- 1 root root 111279899 2009-02-26 17:30 9c923acc-14b4-460d-946e-3b0d4d2e18e6.sda_data.snap
-rw------- 1 root root       391 2009-02-26 17:30 config.ini
node1:~#

To export the backup to another cluster node, e.g. node3, we run

gnt-backup import -n node3.example.com -t drbd --src-node=node1.example.com --src-dir=/var/lib/ganeti/export/inst1.example.com/ inst1.example.com

 

14 Masterfailover

Now let's assume our cluster master, node1, has gone down for whatever reason. Therefore we need a new master. To make node2 the new cluster master, we run the following command on node2:

node2:

gnt-cluster masterfailover
node2:~# gnt-cluster masterfailover
caller_connect: could not connect to remote host node1.example.com, reason [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectError'>: An error occurred while connecting: 113: No route to host.
]
could disable the master role on the old master node1.example.com, please disable manually
caller_connect: could not connect to remote host node1.example.com, reason [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectError'>: An error occurred while connecting: 113: No route to host.
]
caller_connect: could not connect to remote host node1.example.com, reason [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectError'>: An error occurred while connecting: 113: No route to host.
]
node2:~#

Now run

gnt-cluster getmaster

to verify that node2 is the new master:

node2:~# gnt-cluster getmaster
node2.example.com
node2:~#

Now when node1 comes up again, we have a split-brain situation - node1 thinks it is the master...

node1:

gnt-cluster getmaster
node1:~# gnt-cluster getmaster
node1.example.com
node1:~#

... while in fact node2 is the master.

To fix this, we edit /var/lib/ganeti/ssconf_master_node on node1:

node1:

chmod 600 /var/lib/ganeti/ssconf_master_node
vi /var/lib/ganeti/ssconf_master_node
node2.example.com
chmod 400 /var/lib/ganeti/ssconf_master_node

Afterwards,...

gnt-cluster getmaster

... shows the right master:

node1:~# gnt-cluster getmaster
node2.example.com
node1:~#

To make node1 the master again, just run

gnt-cluster masterfailover

on node1 - if both node1 and node2 are running during this operation, both will know that node1 is the new master afterwards.

 

Share this page:

0 Comment(s)