I'm sure I won't be providing nearly enough information, but hopefully can fill in the blanks where needed. Had an issue in our ESXi 5.1 environment a couple days ago where our Cisco Nexus switches observed a network loop and went into some sort of state where traffic wasn't being properly forwarded (presumably STP kicked in and stopped forwarding on certain key uplinks). Solution ended up being the reboot of one of the Nexus switches. We aren't the network team but are trying to wrap our heads around what may or may not have happened.
We have a Dell Blade Center with dual Force10 switches in it. Each "blade" has one connection to each of these switches. The switches each have two uplinks to separate Nexus 5K switches. We run numerous VLANs over these uplinks (both storage -- NFS and production traffic). Ideally we'd have physical separation for storage and front end traffic, but they are 10GbE lines and things were architected to just send everything over the two links with VLAN segmentation only. We're not close to saturating the lines.
Within ESXi, there are two dvSwitches using teaming... traffic is balanced based on physical NIC load. In thinking about this it's not clear that this is the right setting for accessing NFS-based datastores on our NetApp filers, but is how things were set up originally and has been working OK for a long time.
Our network admins tell us that at some point, the MAC address associated with the Storage vmkernel NIC showed up in two different places. The Nexus switches saw this as a loop and apparently this triggered all of our issues.
I'm trying to wrap my head around how that MAC could have showed up two places at once? Could our use of Physical NIC Load balancing be the cause? One would think it would expose the MAC only one direction and wouldn't be flipping back between the two ports unless there was sufficient load (highly unlikely that we'd see 75% load on a 10GbE link in our environment).
Too many gaps in the above tale? Just looking for some ideas on where to start troubleshooting first. We're working with VMware Support, Dell and Cisco, but none are seeing anything obviously wrong in "their" layer of the stack.