Go Back   HowtoForge Forums | HowtoForge - Linux Howtos and Tutorials > Linux Forums > Technical

Do you like HowtoForge? Please consider supporting us by becoming a subscriber.
Reply
 
Thread Tools Display Modes
  #1  
Old 13th October 2011, 17:02
Mark_NL Mark_NL is offline
Senior Member
 
Join Date: Sep 2008
Location: The Netherlands
Posts: 912
Thanks: 12
Thanked 100 Times in 96 Posts
Default Cisco, Solaris and LACP

HI there,

Don't know if this is the right place to ask, but let's just give it a try

So I have this network.. 8 racks with servers (web/red5/mysql/nexenta etc)
14 switches all cisco:

2 x3750: Stacked
12x 2960: Has 2 or 4 cat6 cables evenly spread on a 3750 for trunking the internal and external vlan.

One of our file servers has beside the internal 2 network adapters a e1000 server adapter, the file server is running on a single internal interface (1gbit).

Across in the racks are ~50 web server which all have a NFS (UDP) mount to the file server.

The e1000 in the file server has both connectors connected to a 2960 because i wanted to change it to a 2gbit aggregate so i have 2gbit of bandwidth available to/from the server.

So today ..

- I created a Port-channel on the cisco 2960, mode active LACP for the 2 Gi's the file server was connected to. (only vlan 200 which is internal)

Code:
port-channel load-balance src-dst-ip
!
interface Port-channel2
 switchport access vlan 200
!
interface GigabitEthernet1/0/36
 switchport access vlan 200
 channel-group 2 mode active
!
interface GigabitEthernet1/0/38
 switchport access vlan 200
 channel-group 2 mode active
!
lw-r2-core#sh etherchannel summary
Code:
2      Po2(SU)         LACP      Gi1/0/36(P) Gi1/0/38(P)
- On the file server i added a link-aggr with dladm
Code:
root@alcor:~# dladm show-aggr
LINK            POLICY   ADDRPOLICY           LACPACTIVITY  LACPTIMER   FLAGS
aggr0           L3,L4    auto                 active        short       -----
So all looks fine .. web servers can connected over the aggregate etc, looks all good and well

Until nagios started sending SMS messages .. high load on 15-20 web servers .. hmm .. let's have a look ..

Code:
[183613.720649] nfs: server 192.168.5.181 OK
[183613.721477] nfs: server 192.168.5.181 OK
[183673.596026] nfs: server 192.168.5.181 not responding, still trying
[183677.996026] nfs: server 192.168.5.181 not responding, still trying
[183677.996033] nfs: server 192.168.5.181 not responding, still trying
[183677.996659] nfs: server 192.168.5.181 OK
[183677.997555] nfs: server 192.168.5.181 OK
[183677.997563] nfs: server 192.168.5.181 OK
[183687.588027] nfs: server 192.168.5.181 not responding, still trying
[183687.590185] nfs: server 192.168.5.181 OK
aw crap .. now of the ~50 servers, about 40% of them got a high load (high wait state) ..
the other servers using the exact same NFS mount sharing racks with servers that didn't like the aggregate upgrade. so where to look ..

tcpdump gave me A LOT of UDP packets being resend from the file server to the web server .. while that didn't happen on web servers that where working perfect. MTU is on 1500 in my whole network, package length was 1514 .. OK fragmentation that's causing output socket buffers to fill up or something?

The thing is .. when i remounted the NFS mount to TCP, the load went away and everything runs smooth. BUT that's just a work around, it should work with UDP just as well as it does with TCP.

So 40% of the servers high load .. hmm "Could it be .. " yes .. it seems the web servers with the high load where all assigned to Gi1/0/38 on the switch .. the others on Gi1/0/36 .. disable port Gi1/0/38 .. et voila!

UDP mounted servers with high load started to run smooth again with low load.

So I'm kinda stuck where to look now .. and what this could cause this behavior.

I'm thinking about the UDP fragmentation .. When i lowered the MTU on the file server, there where less fragmented packages being send, but still, it didn't really help.

ALL switches in the network have src-dest-ip load balancing enabled.

So in short:

- File server single uplink
-- NFS over UDP: working servers 100%
-- NFS over TCP: working servers 100%

- File server: LACP link
-- NFS over UDP: working servers 50%
-- NFS over TCP: working servers 100%

- File server: LACP link (pulled one cable)
-- NFS over UDP: working servers 100%
-- NFS over TCP: working servers 100%

tomorrow i'm going to disable Gi1/0/36 and enable Gi1/0/38 to see if it's the physical switch port.

Anyone any pointers?
__________________
Real men don't backup... Real men cry!

http://www.e-rave.nl/
Reply With Quote
Sponsored Links
  #2  
Old 14th December 2011, 09:57
Mark_NL Mark_NL is offline
Senior Member
 
Join Date: Sep 2008
Location: The Netherlands
Posts: 912
Thanks: 12
Thanked 100 Times in 96 Posts
 
Default

Thinking about this cartoon



Here's the "solution"..
What we eventually did was remount everything to TCP .. at some point the amount of data that's being send needs to be controlled/checked. UDP becomes to unstable at those amounts of data, so in short: use NFS over TCP
__________________
Real men don't backup... Real men cry!

http://www.e-rave.nl/
Reply With Quote
The Following User Says Thank You to Mark_NL For This Useful Post:
falko (15th December 2011)
Reply

Bookmarks

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 03:00.


Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2014, vBulletin Solutions, Inc.