View Single Post
  #5   (View Single Post)  
Old 14th September 2010
rfranzke rfranzke is offline
New User
 
Join Date: May 2008
Posts: 3
Default

OK been some time but I might have figured out what is going on here. I moved to FreeBSD 8.1 Release on a Dell blade server with internal switches, and Lagg is working better but not fully. I am not sure if this is a BSD issue or a problem with our switches but this issue could crop up in certain environments as it does in mine. I have two switches btw. One LAGG nic goes to one switch, and the other LAGG nic goes to a secondary switch to be able to recover from a switch failure as well as a nic/link failure. These switches are then uplinked into our core switches.

It seems that when operating in failover mode lagg does not have any mechanism to tell the network that the MAC for the LAGG interface has moved. It needs to be able to pre-populate the cam tables in the switch it is connected to when a failover occurs. This would be required for the rest of the network to see the layer 2 address (MAC) has moved and can now be reached on a different port/location on the network. Without this mechanism, the layer 2 topology no longer sees the MAC address of the LAGG interface at all as the default behavior for Cisco switches anyway is to flush the cam table entries associated with any port that is down. Upstream switches send traffic to the last recorded entry in their cam table. But that entry is now missing on the downstream switches connected to the LAGG host. This results in the LAGG host being unavailable from the rest of the network.

What I see is that the once I issue a ping from the BSD server out to network, the cam table entries are repopulated in the switches and the host again becomes reachable from the rest of the network. On Cisco switches, the default CAM table cache timeout is 5 minutes which results in a LAGG enabled host being inaccessible for a full 5 minutes until the MAC table entry for the LAGG host times out. This obviously is not a very attractive scenario.
I think this is a flaw in the way LAGG operates. I think some of the windows drivers for doing NIC teaming issue a Gratuitous arp to repoulate the cam entries upon a failure of a primary NIC. I think LAGG should do something similar.

Perhaps someone has a way to fix this issue (aside from running a ping cron or something equally silly to get this to work) or has some experience setting this up differently. I would prefer not to have to lower the cam timers on my switches for this to work. Any help is appreciated.
Reply With Quote