Archive

Archive for the ‘Troubleshooting’ Category

Fighting ESXi 6.0 Network Disconnects–might be KB2124669?

October 7, 2015 2 comments

Recently I’ve been having some issues in my home lab where hosts seem to stop responding.  I’ve been busy, so had a tough time finding the time to verify if the hosts crashed, locked up, or if it was just a networking issue.  So quite unfortunately, I’ve been rebooting them to bring them back and keeping it on my To Do list.

Until it started happening at work.  We have a standalone box that is non clustered that has internal disks and functions as a NAS.  So it’s “very important”, but “not critical” – or it would be HA, etc.  And the same symptoms occurred there – where we of course HAD to troubleshoot it. 

Tried the following:

* Disconnect the NIC’s and reconnect

* Shut/no-shut the ports on the switch

* Access via IPMI/IDRAC – works, could access DCUI to verify host was up.  Could ping VM’s on the host, so verified those were working.

* Tried removing or adding other NIC’s via the DCUI to the vSwitch that has the management network.  No go.  Didn’t matter if the two onboard NIC’s, two add in, split across two controllers, one NIC only, etc. 

Understandably we’re pretty sure we have a networking issue – but only on the one host at this time.  The Dell R610 and R620 have no similar issues, but this is an in-house built Supermicro we use for Dev.  So the issue appears related to the host, and having only one, it’s tough to troubleshoot.

ESXi NIC’s are not the same – the host has 82574L onboard and 82571EB 4 port.  The C6100 I have is 82576 – and I haven’t confirmed it is having the same issue.  But in any case, it doesn’t seem like an issue with a certain chipset or model of NIC. 

So we’ve started looking for others having similar issues and found:

https://communities.vmware.com/thread/517399

After upgrade of ESXi 5.5 to 6.0 server loses every few days the network connection

Reach out to VMware support on this issue to see if it is related to a known bug in ESXi 6.  The event logged in my experience with this same situation is "netdev_watchdog:3678: NETDEV WATCHDOG".  I did not see this in your logs but the failure scenario I have experienced is the same as you described.

Okay, not great, but sounds like the same sort of issue.  Someone then suggested that the bug is known.  Okay.   Check the release notes and:

http://pubs.vmware.com/Release_Notes/en/vsphere/60/vsphere-esxi-60u1-release-notes.html

New Network connectivity issues after upgrade from ESXi 5.x to ESXi 6.0
After you upgrade from ESXi 5.x to ESXi 6.0, you might encounter the following issues
The ESXi 6.0 host might randomly lose network connectivity
The ESXi 6.0 host becomes non-responsive and unmanageable until reboot
After reboot, the issue is temporarily resolved for a period of time, but occurs again after a random interval
Transmit timeouts are often logged by the NETDEV WATCHDOG service in the ESXi host. You may see entries similar to the following in the/var/log/vmkernel.log file:
cpu0:33245)WARNING: LinNet: netdev_watchdog:3678: NETDEV WATCHDOG: vmnic0: transmit timed out
cpu0:33245)WARNING: at vmkdrivers/src_92/vmklinux_92/vmware/linux_net.c:3707/netdev_watchdog() (inside vmklinux)
The issue can impact multiple network adapter types across multiple hardware vendors. The exact logging that occurs during a transmit timeout may vary from card to card.

That sounds similar.  We’re still digging through logs and also waiting for it to occur again to catch it in the act.

The next hit we found was:

https://communities.vmware.com/message/2525461#2525461

If you’re seeing this in your vmkernel.log at the time of the disconnect it could be related to an issue that will one day be described at the below link (it is not live at this time). We see this after a random amount of time and nothing VMware technical support could do except reboot the host helped.

http://kb.vmware.com/kb/2124669

KB2124669 definitely covers the observed symptoms.  Note the date:

  • Updated: Oct 6, 2015

That’s yesterday, maybe we’re on the right track!  Now also look at the RESOLUTION:

This issue is resolved in ESXi 6.0 Update 1a, available at VMware Downloads. For more information, see the VMware ESXi 6.0 Update 1a Release Notes.

There’s a 1*A* now?

https://my.vmware.com/group/vmware/details?downloadGroup=ESXI600U1&productId=491&rPId=8764#errorCheckDiv

image

There is in fact, a 6.0U1a out.  Who knew!  Released _yesterday_.

Now, watch the top of your screen, because VMware DOES attempt to make sure you know there’s some issues (as there always are, no new release is perfect):

image

http://kb.vmware.com/kb/2131738

And in that list, includes:

Also if you haven’t bumped into it before, there’s great links here for Upgrade Best Practices, Update Sequences, etc.  DO give this a read.

Understandably, we won’t get this fix in TODAY.  We do have a maintenance cycle coming up soon, where we’ll get vCenter Server 6.0U1 in and can go to ESXi v6.0U1a – and hopefully this will fix the network issue.  If not, back to troubleshooting I guess.  Fingers crossed. 

I’ll post an update next week if we get the upgrade in and see a resolution to this. 

Advertisements