HowtoForge

High Availability NFS With DRBD + Heartbeat

High Availability NFS With DRBD + Heartbeat

 Ryan Babchishin - http://win2ix.ca 

This document describes information collected during research and development of a clustered DRBD NFS solution. This project had two purposes:

 

Operating System

The standard operating for Win2ix is Ubuntu 12.04, therefore all testing was done with this as the preferred target.

 

Hardware

Because of the upcoming project with Media-X, computer hardware was chosen based on low cost and low power consumption.

A pair of identical systems for the cluster were used:

 

Partitioning and disk format

The disks were partitioned according to Win2ix standards.

 

Networking

Bonding was tested, see below for more information.

 

Bonding

Although not used due to lack of an extra PCI-E slot, Ethernet bonding was originally tested

Important notes when using bonding:

Sample working /etc/network/interfaces configuration segment:

iface bond0 inet static
 	address 192.168.0.1
 	netmask 255.255.255.0
 	bond-mode 0
 	bond-miimon 100
 	bond-slaves eth0 eth1

 

Kernel Tuning

These sysctl changes seemed to make a small improvement, so I left them intact. This would need to be added to /etc/sysctl.conf.

# drbd tuning
net.ipv4.tcp_no_metrics_save = 1
net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 87380 33554432
vm.dirty_ratio = 10
vm.dirty_background_ratio = 4

 

DRBD

DRBD is the tricky one. It doesn't always perform well despite what it's developers would like you to think.

 

DRBD Performance

 

Working Example

Working and well performing DRBD resource configuration

                    resource r0 {
			 net { 
				#on-congestion pull-ahead;
				#congestion-fill 1G;
				#congestion-extents 3000;
				#sndbuf-size 1024k; 
				sndbuf-size 0;
				max-buffers 8000;
				max-epoch-size 8000;	
			 }
			 disk {
				#no-disk-barrier;
				#no-disk-flushes;
				no-md-flushes;
			 }
			 syncer {
                                c-plan-ahead 20;
                                c-fill-target 50k;
                                c-min-rate 10M;
				al-extents 3833;
				rate 35M;
				use-rle;
			 }
			 startup { become-primary-on nfs01 ; }
                         protocol C;
			 device minor 1;
                         meta-disk internal;

			on nfs01 {
                              address 192.168.0.1:7801;
                              disk /dev/sda6;
                         }

			 on nfs02 {
                              address 192.168.0.2:7801;
                              disk /dev/sda6;
                         }

                    }

Relevant section of /etc/fstab used with this configuration:

# DRBD, mounted by heartbeat
/dev/drbd1	/mnt		ext4	noatime,noauto,nobarrier	0	0

 

Benchmarking

Before bothering with NFS or anything else, it is a good idea to make sure DRBD is performing well.

Benchmark tools

 

DD

There are some simple tests you can do to test performance of a storage device using DD. However, other tools should be used later for more accurate results (real world). When I'm benchmarking or trying to identify bottlenecks, I run atop on the same system in a separate terminal while dd is transferring data.

dd if=/dev/zero of=testfile bs=100M count=20 oflag=direct
dd if=/dev/zero of=testfile bs=100M count=20 conv=fsync
dd if=testfile of=/dev/null bs=100M iflag=direct
dd if=testfile of=/dev/null bs=100M

sync
echo 3 > /proc/sys/vm/drop_caches

dd if=/dev/zero of=/dev/sdXX bs=100M count=20 oflag=direct
dd if=/dev/sdXX of=/dev/null bs=100M count=20 oflag=direct

 

NFS Server

The only configuration was to '/etc/exports':

/mnt 192.168.3.0/24(rw,async,no_subtree_check,fsid=0)

'/etc/idmapd.conf' may need to be adjusted to match your domain when using NFSv4 (on client and/or server)

 

NFS Client

In testing, I chose to use this command to mount NFS:

mount nfs:/mnt /testnfs -o rsize=32768,wsize=32768,hard,timeo=50,bg,actimeo=3,noatime,nodiratime

Explanation:

 

Clustering

Heartbeat without Pacemaker was chosen. Pacemaker seemed too complex and difficult to manage for what was needed.

 

Heartbeat

It this test setup, heartbeat has 2 Ethernet connections to communicate between nodes. The first is the network/subnet/lan connection and the other is the DRBD direct crossover link. Having multiple connection paths is important so that one heartbeat node doesn't lose contact with the other. Once that happens, neither one knows which is master and the cluster becomes 'split-brained'.

 

STONITH

S.T.O.N.I.T.H = Shoot The Other Node In The Head

STONITH is the facility that Heartbeat uses to reboot a cluster node that is not responding. This is very important because heartbeat needs to know that the other node is not using DRBD (or other corruptible resources). If a node is really not responding at all, the other node will reboot it using STONITH, which uses IPMI (in the examples below) and then take over the resources.

When two nodes believe they are master (own the resources) it is called split-brain. This can lead to problems and sometimes data corruption. STONITH w/IPMI can protect against this.

 

Working Example

The ha.cf file defines the cluster and how its nodes interact.

/etc/ha.d/ha.cf:

# Give cluster 30 seconds to start
initdead 30
# Keep alive packets every 1 second
keepalive 1
# Misc settings
traditional_compression off
deadtime 10
deadping 10
warntime 5
# Nodes in cluster
node nfs01 nfs02
# Use ipmi to check power status and reboot nodes
stonith_host    nfs01 external/ipmi nfs02 192.168.3.33 ADMIN somepwd lan
stonith_host    nfs02 external/ipmi nfs01 192.168.3.34 ADMIN somepwd lan
# Use logd, configure /etc/logd.cf
use_logd on
# Don't move service back to preferred host when it comes up
auto_failback off
# If all systems are down, it's failure
ping_group lan_ping 192.168.3.1 192.168.3.13
# Takover if pings (above) fail
respawn hacluster /usr/lib/heartbeat/ipfail

##### Use unicast instead of default multicast so firewall rules are easier
# nfs01
ucast eth0 192.168.3.32
ucast eth1 192.168.0.1
# nfs02
ucast eth0 192.168.3.31
ucast eth1 192.168.0.2

The haresources file describes resources provided by the cluster. It's format is: [Preferred node] [1st Service] [2nd Service]... services are started in the order they are listed and stopped in the reverse order. They will start on the preferred node when possible.

/etc/ha.d/haresources:

nfs01 drbddisk::r0 Filesystem::/dev/drbd1::/mnt::ext4 IPaddr2::192.168.3.30/24/eth0 nfs-kernel-server

The logd.conf file defines logging for heartbeat.

/etc/logd.conf:

debugfile /var/log/ha-debug
logfile	/var/log/ha-log
syslogprefix linux-ha

 

Testing Fail-over

There are numerous tests you can perform. Try pinging the floating IP address while pulling cables, initiating heartbeat takover, killing heartbeat with SIGKILL, etc... But my favourite test is of the NFS service, the part that matters the most. /var/log/ha-debug will have lots of details about what heartbeat is doing during your tests.

 

Testing NFS Fail-over

High Availability NFS With DRBD + Heartbeat