Archive for the ‘ESXi’ Category

Fighting ESXi 6.0 Network Disconnects–might be KB2124669?

October 7, 2015 2 comments

Recently I’ve been having some issues in my home lab where hosts seem to stop responding.  I’ve been busy, so had a tough time finding the time to verify if the hosts crashed, locked up, or if it was just a networking issue.  So quite unfortunately, I’ve been rebooting them to bring them back and keeping it on my To Do list.

Until it started happening at work.  We have a standalone box that is non clustered that has internal disks and functions as a NAS.  So it’s “very important”, but “not critical” – or it would be HA, etc.  And the same symptoms occurred there – where we of course HAD to troubleshoot it. 

Tried the following:

* Disconnect the NIC’s and reconnect

* Shut/no-shut the ports on the switch

* Access via IPMI/IDRAC – works, could access DCUI to verify host was up.  Could ping VM’s on the host, so verified those were working.

* Tried removing or adding other NIC’s via the DCUI to the vSwitch that has the management network.  No go.  Didn’t matter if the two onboard NIC’s, two add in, split across two controllers, one NIC only, etc. 

Understandably we’re pretty sure we have a networking issue – but only on the one host at this time.  The Dell R610 and R620 have no similar issues, but this is an in-house built Supermicro we use for Dev.  So the issue appears related to the host, and having only one, it’s tough to troubleshoot.

ESXi NIC’s are not the same – the host has 82574L onboard and 82571EB 4 port.  The C6100 I have is 82576 – and I haven’t confirmed it is having the same issue.  But in any case, it doesn’t seem like an issue with a certain chipset or model of NIC. 

So we’ve started looking for others having similar issues and found:

After upgrade of ESXi 5.5 to 6.0 server loses every few days the network connection

Reach out to VMware support on this issue to see if it is related to a known bug in ESXi 6.  The event logged in my experience with this same situation is "netdev_watchdog:3678: NETDEV WATCHDOG".  I did not see this in your logs but the failure scenario I have experienced is the same as you described.

Okay, not great, but sounds like the same sort of issue.  Someone then suggested that the bug is known.  Okay.   Check the release notes and:

New Network connectivity issues after upgrade from ESXi 5.x to ESXi 6.0
After you upgrade from ESXi 5.x to ESXi 6.0, you might encounter the following issues
The ESXi 6.0 host might randomly lose network connectivity
The ESXi 6.0 host becomes non-responsive and unmanageable until reboot
After reboot, the issue is temporarily resolved for a period of time, but occurs again after a random interval
Transmit timeouts are often logged by the NETDEV WATCHDOG service in the ESXi host. You may see entries similar to the following in the/var/log/vmkernel.log file:
cpu0:33245)WARNING: LinNet: netdev_watchdog:3678: NETDEV WATCHDOG: vmnic0: transmit timed out
cpu0:33245)WARNING: at vmkdrivers/src_92/vmklinux_92/vmware/linux_net.c:3707/netdev_watchdog() (inside vmklinux)
The issue can impact multiple network adapter types across multiple hardware vendors. The exact logging that occurs during a transmit timeout may vary from card to card.

That sounds similar.  We’re still digging through logs and also waiting for it to occur again to catch it in the act.

The next hit we found was:

If you’re seeing this in your vmkernel.log at the time of the disconnect it could be related to an issue that will one day be described at the below link (it is not live at this time). We see this after a random amount of time and nothing VMware technical support could do except reboot the host helped.

KB2124669 definitely covers the observed symptoms.  Note the date:

  • Updated: Oct 6, 2015

That’s yesterday, maybe we’re on the right track!  Now also look at the RESOLUTION:

This issue is resolved in ESXi 6.0 Update 1a, available at VMware Downloads. For more information, see the VMware ESXi 6.0 Update 1a Release Notes.

There’s a 1*A* now?


There is in fact, a 6.0U1a out.  Who knew!  Released _yesterday_.

Now, watch the top of your screen, because VMware DOES attempt to make sure you know there’s some issues (as there always are, no new release is perfect):


And in that list, includes:

Also if you haven’t bumped into it before, there’s great links here for Upgrade Best Practices, Update Sequences, etc.  DO give this a read.

Understandably, we won’t get this fix in TODAY.  We do have a maintenance cycle coming up soon, where we’ll get vCenter Server 6.0U1 in and can go to ESXi v6.0U1a – and hopefully this will fix the network issue.  If not, back to troubleshooting I guess.  Fingers crossed. 

I’ll post an update next week if we get the upgrade in and see a resolution to this. 

C6100 IPMI Issues with vSphere 6

July 15, 2015 Leave a comment

So I’m not 100% certain if the issues I’m having on my C6100 server are vSphere 6 related or not.  But I have seen similar issues before in my lab, so it may be one of a few things.

After a recent upgrade, I noted that some of my VM’s seemed “slow” – which is hard to quantify.  Then this morning I wake up to having internet but no DNS, so I know my DC is down.  Hosts are up though.  So I give them a hard boot, connect to the IPMI KVM, and watch the startup.  To see “loading IPMI_SI_SRV…” and it just sitting there.

In the past, this seemed to be related to a failing SATA disk, and the solution was to pop it up – which helped temporarily until I replaced the disk outright.  But these are new drives.  Trying the same here did not work, though I only tried the spinning disks and not the SSD’s.  Rather than mess around, I thought I’d find a way to see if I could disable IPMI at least to troubleshoot.

Turns out, I wasn’t alone – though just not specific to vSphere 6:

That last one is the option I took:

  • Press SHIFT+O during the Hypervisor startup
  • Append “noipmiEnabled” to the boot args

Which got my hosts up and running. 

I haven’t done any deeper troubleshooting, nor have I permanently disabled the IPMI with the options of:

Manually turn off or remove the module by turning the option “VMkernel.Boot.ipmiEnabled” off in vSphere or using the commands below:

# Do a dry run first:
esxcli software vib remove –dry-run —vibname ipmi–ipmi–si–drv
# Remove the module:
esxcli software vib remove —vibname ipmi–ipmi–si–drv

We’ll see what comes when I get more time…

Categories: C6100, ESXi, Home Lab, vSphere

General Availability of VMware ESXi 5.5.2 Patch & vSphere v5.5 U2

September 15, 2014 Leave a comment

I figured I’d spread the word, that VMware ESXi 5.5.2 patches are now out.  One can check out the original announcement right from VMware at:

Also, vSphere v5.5 Update 2 was recently announced as well.  One of the best posts I found on this was over at Vladen Seget’s blog: 

The best highlights of the release, in my mind?

  • vSphere Client (C#) will now allow editing of VMware Hardware v10 VM’s.  Note that it will only allow editing of features exposed in v8 hardware.  So you won’t be able to edit any vSphere Flash Read Cache or AHCI disks, but otherwise, you’ll largely be okay.  On the one hand, this is good as it buys us some time to stay with the C# client – on the other, it’s a bitter reminder that soon we’ll have only the Web Client…
  • vCenter Server now supports running on SQL Server 2012 SP1 and 2014
  • vCenter Server is now at v5.5 U2 – which is far better than trying to figure out if it’s on v5.5 U1a/b/c/driver-rollup.  They really need to standardize this – when is it a x.y.0(a/b/c), when is it an x.y Update, etc? 
  • Just about every product got an update to coincide with this one, including many v5.8 versions announced at VMworld 2014.
  • Horizon View v6.0.1 and Horizon Workspace v2.1 are also refreshed.

Release notes can be found at:

Downloads can be found at:

Remember to check your 3rd party software for compatibility before jumping ahead.  Software such as Trend Micro DSA or Veeam Backup & Recovery may require updates before they’re 100% compatible.

As always, it’s best to clone your vCenter into an isolated environment (with a DC and DB server) to test the upgrade process with safety before you jump in all the way.

Lab Learning Lessons–01

July 27, 2014 Leave a comment

So I figured I’d start a new theme, which the title represents.  This is “Lab Learning Lessons” or things you learn in the lab, that are better learned here than in Production somewhere.  Hopefully this will help you out in the future, or if nothing else will reinforce for me that I’ve done this before.  So with that in mind – this week’s lessons!


1) I can’t find the stupid ISCSI Target!

Ever have one of those days?  Setup a new SAN, configure the NIC’s, configure ISCSI, make some LUN’s, configure your Initiator Groups, and… nothing?  Add the ISCSI target to the Dynamic Initiators in the ESXi Software ISCSI Initiator and…. it never finds any Static Initiators like it should?  So you try to do a “vmkping <target_IP>” and sure enough, THAT works.  Worse yet, you do the SAME thing on the secondary NetApp (in this case) controller in that chassis, and IT works.  So you’re doing the right thing.  So you check against the OLD controllers – and your settings are similar as they should be.

So you change the IP addresses on the Targets and… boom.  It works.

Lesson Learned:  IP Address conflicts don’t tell you if the thing that is responding to your test pings is the device you WANT it to be.


2) Can’t vMotion VM’s.  Or create a new one.  Or create objects on a new datastore.

Sounds strange right?  The error includes “pbm.fault.pbm fault.summary” for everything you do.  VM’s are otherwise working and doing what they should be.  You can start, restart, reboot, etc.  You just can’t move them around.  A little Google-fu will suggest that you restart the vCenter Inventory and/or Profile Driven Storage service(s).  Sounds reasonable.  Except those take forever to do so.  So you reboot the vCenter server, hoping that’ll help.  No go.

Then you open Explorer…. and realize your vCenter is out of space.  Except all the services are quite happily started.  No “Automatic” services are unstarted or failing to start.  Nothing is tripping an error.  It’s just “not working”.

Lesson Learned:  Maybe if you’re not going to use a 3rd party monitoring solution (eg: Nagios, ZenOS, PRTG, SolarWinds, etc), then you should configure basic Windows Scheduled Tasks to send e-mails when a drive gets to a certain used amount.  Might save some stress.


3) IP Address Planning.

I’m big on having “predictable” IP Address standards.  If you can, have “Primary” addresses be a x.y.z.1# and “Secondary” be x.y.z.2#, or some other system that works for you.  Also if you have 4 NIC’s maybe the #’s in the previous examples should be the NIC #.  So on a NetApp, e0c and e0d would be 3 and 4, so your IP would be x.y.z.13, x.y.z.14, x.y.z.23, x.y.z.24, or something else.

The downside is you really need to be able to look at the final configuration, and work backwards.  Are you going to do one IP per NIC?  One per LACP/PortChannel?  (Not so much for ISCSI, but for NFS/CIFS).  if you do one for a virtual interface like an LACP vif – what # is it?  It’s none of NIC 1-4 (e0a/e0b/e0c/e0d).  Would you make it .10 and .20?  Maybe.  Or maybe .19 and .29, as it’s ‘odd’.

What if you have a second unit in the same place?   Is your solution scalable?  The if your first pair of controllers is NW-SAN1 and NW-SAN2, and is .1x and .2x, then NW-SAN3 and NW-SAN4 could easily be .3x and .4x – but are you chewing up a lot of IP’s?  Maybe.  But in my opinion, it’s so worth it.  Reading logs and troubleshooting becomes amazingly simpler, as you can now logically map one device to another by IP, hostname, port and NIC.

Lesson Learned:  Plan out as much as you can in advance.   But if you can’t, try it in a simulator and work through your options.  This is why they exist, and why we have labs.

PernixData FVP v1.5 GA on vSphere v5.5 First Look

March 15, 2014 1 comment

So one of my most recent posts was about fixing my UUID issue on my Dell C6100 series server.  Of course, what prompted that initially and identified the problem, was PernixData’s FVP product – way back in the 0.9 Beta if I recall.  Now that I’ve gotten this solved, of course, I wanted to give FVP a try again. 

So out goes some e-mails to PernixData with a request for download ( – go request a trial!  You’ll like it…)  A quick chat with Chris Floyd (@phloider) and Peter Chang (@virtualbacon) gets me set up with the trial again.  However, a quick look says “.. vSphere v5.0 and v5.1…”  Well that’s no good, I’m on v5.5.0 U1 (of course, why not be an early adopter Smile).  So that looks like it’s out of the question.  Then they tell me the new version is supposed to GA on Monday March 17.  Well I can wait that long I figure.  That lasted until about 7PM on Friday, at which point I went to download the beta anyway.


Not being up on the current version number (I hadn’t been keeping track, what with the UUID issue, why disappoint myself further that my hardware doesn’t like their software), so I go ahead and download the ‘beta’ figuring I’ll give it a try.  Not 10 minutes later I get an e-mail from Chris with a subject line of “New plans for the weekend…” the body of which stated: “You were the first person to download 1.5 GA. Let me know what you think.”

Well dammit.  I’m not waiting till Monday now Smile 

First, nothing in this post should supersede what’s in the documentation – which is actually really good.  This is my notes version, and cheat sheet.  If you follow my notes and didn’t read their documentation at all – that’s on you.  With that said… let’s begin!


1) Install and configure the Management Server


I’ve chosen to install this in my lab on my vCenter server using the same svcVMware AD account.  Run PernixData FVP Management Server – 1.5.03869.0.exe and start the installation.


This really is the first screen that isn’t “Next, Next, Finish-y”. 


I’ve opted to use the same SQL_EXPRESS instance used by my vCenter Server – probably not the best way to go if in Production, but works good enough here.


Next we tell the FVP Management Server how it should be found on the network.


And then click INSTALL.


A JRE?  Yeah, go ahead and install that too if it’s needed.


2) Configure FVP


Next, you’d normally install the plug in.  The vSphere Client Plug-in for FVP v1.5 is only for vSphere v5.0 or v5.1.  For v5.5 the plug in is installed in the vSphere Web Client – and there’s nothing to do, as the installer added it to vCenter Server. 


So log in to the vSphere web client and click on vCenter.  You’ll see a PernixData FVP section at the bottom.  Click on FLASH CLUSTERS.




Name your cluster and select the cluster you want to attach it to.  Click OK.


Next you’ll see the Getting Started tab.  Click on the MANAGE tab.


It will show FLASH DEVICES.  Click ADD DEVICE.  You’ll quickly get prompted that you’re a fool and haven’t installed the software on the hosts.  Duly noted. 


3) (should have been 2) Add the FVP Extensions to

the host(s)


Installation is either via uploading to the host and installation via SSH or VUM – which is “Experimental” at this state.  However, I would like to see the VUM method work as it is more automated, so let’s give that a try.


In the vSphere Client, browse to HOME –> SOLUTIONS –> UPDATE MANAGER.  Click on the PATCH REPOSITORY tab.  Click IMPORT PATCHES.


Browse to where you’ve unpacked your FVP v1.5 software, and select the ESXi v5.5 update.  Click NEXT.  You may get prompted to install/accept/ignore a certificate – do so.




I’d never seen the patches not show up right away, but apparently my vCenter was busy.  Watch the RECENT TASKS pane to ensure the patches are Confirm Imported. 


Then confirm by entering PERNIX in the search box.

Click on the BASELINES AND GROUPS tab, and click CREATE on the BASELINE side.


Name your baseline and select HOST EXTENTION.  Click NEXT.


Search for Pernix, click the down arrow to add it to the lower window, and click NEXT.


On the READY TO COMPLETE screen, click FINISH.


If you have a Baseline Group you may want to add the Extension to your Baseline Group.  Click COMPLIANCE VIEW in the upper right to return to your hosts and clusters view.  Select your cluster and click SCAN to check for updates required.


Click REMEDIATE.  Then select only the EXTENTIONS BASELINE and select the PERNIXDATA FVP v1.5 GA baseline.  Check all applicable hosts and click NEXT.


Click NEXT, NEXT, then set your remediation options.  I like to disable removable media and set my retries for every 1 minute and 33 retries –largely because it’s easy to type/change with one hand.  Click NEXT.


Choose whatever remediation options make you happy and click NEXT and FINISH.  Then wait for the magic to happen.


4) NOW configure FVP 🙂


Now that you’ve added the extensions, let’s go back adding devices:


Only 2 of my 4 hosts are showing up right now – that’s fine.  I’m going to choose to add my Kingston V300 120GB SSD’s (here’s hoping they work and are on the HCL), and click OK.


Now that the devices show up, click on DATASTORES/VM’s


Next we’ll click ADD DATASTORE.


Only one of my datastores is ISCSI, and FVP only accelerates block devices – FCP, FCoE, or ISCSI- no local DAS data stores obviously either.  So select the appropriate ISCSI (in my case in the lab) datastore and caching method (Write Through or Write Back) and click OK.  As I want maximum performance I’m going to choose Write Back.


Except when I try that, it tells me all my hosts need to be ready.  So I’ll finish my FVP Extension installations and then retry.  Okay, and there we go Smile


Now we can not only select Write Back, but also select the Write Redundancy.  In order for Write Back to be safe, we need to select a mirror/parity for that cache on another host in case of the host with the primary cache failing.  For my lab, HOST+1 is more than enough.


Understandably, it will take a little bit of time for VM’s to start caching, and then for that cache to populate on the additional nodes.  Here you can see some VM’s are CONFIGURED for Write Back, but have a current status of Write Through. 


If we go click on MONITOR and PERFORMANCE we can start to see some stats on what’s happening.  Note that my lab isn’t very busy, so we shouldn’t expect to see much.


We can see the IOPS as well. 

So lets go log into a VM on the datastore and run a benchmark.  I’ll use Atto Bench32 which is what I use for quick and dirty throughput tests.  Note that this is not a good IOPS test, but it does give a decently quick indication as to performance and health.


Here you can see some pretty amazing numbers.   At 4.0KB, we’re seeing 2.5x write and 2x read numbers.  By the 16.0KB block size, it’s not even fair any more.  That’s not bad for a couple of $70 SSD’s.


But let’s look at what the FVP console gives us.  First we get a wealth of metrics that the vSphere performance monitor alone doesn’t give us.  You can clearly see that the VM was able to observe almost 9000 IOPS – which is nothing to bitch about. 


So based on this, I’m pretty happy.  I do have to do more testing, get some tweaks in, and better understand the settings.  But clearly I’m going to be able to push the lab a little harder. 


Observations and Conclusion:


For my needs, in my lab, speed is critical.  While I’m by no means business centric, “time is money” and the faster the equipment is, the more things I can do, which means the more I can test and the more I can learn.  I already know how to watch progress bars – so anything I can do to reduce that, will maximize my time.

Secondly, this is pretty amazing for the cost of 4x $70 120GB SSD’s.  Would you use this class of consumer grade MLC in Production with FVP?  Probably (hopefully) not.  But you could make an argument to do so, and just treat them like printer toner cartridges and replace them periodically – as long as that period didn’t fail at the worst time or require a large amount of time swapping SSD’s.

Clearly, I’ve sold the C6100 duplicate UUID/Service Tag problem Smile 

I’ll be doing additional testing in a bit.  But after hearing I was the first to download the GA code, I wanted to be the first to get something up about it.  Hopefully this will help someone else get started up quickly and easily. 

It’s late – time for bed.  But this post was a long time coming – damned C6100 UUIDs…

Categories: C6100, ESXi, PernixData, SSD, vSphere

HOWTO: Correct vSphere ESXi "the ramdisk ‘tmp’ is full"

March 14, 2014 Leave a comment

Tonight we came across an odd error on one of the hosts. We use Nagios for monitoring, and only one host was exhibiting an error checking networking on the host:

Can’t call method "network" on an undefined value at ./ line 865

Checking the host EVENTS tab, did show the following errors:


The following VMware KB article gives details on the issue:


Following the suggestion from the KB, we see that sure enough, the ramdisk for tmp IS full.


Appears to be the /tmp/scratch/downloads folder in question.


Don’t suppose at 1:25PM the logs were exported by any chance? That’s the only supposition I have at the moment.


I can confirm at the Nagios console that it is having isssues with just that host. Note that the Perl method Nagios uses for checking is horrible, as it logs into the host as root to get its stats. I’d fix it to use SOAP or vSphere API’s, but “Nagios is going away” so it seems like I’d just be wasting effort to invest in Nagios.


After removing the offending file in /tmp/scratch/downloads, Nagios is able to run its checks.

Now – does anyone know how to find out WHY it was full? 

Categories: ESXi, nagios, vSphere

HOWTO: Dell C6100 FRU / UUID Update–FINALLY!

March 13, 2014 2 comments

So this post has been a LONG time coming, and I’m pretty sure I’m good to go now.

As you know, the Dell C6100 is a great 4 node in 2U chassis, which works really well for a compact home lab (if you can stand the noise).  vSphere likes it, Hyper-V likes it, what’s to complain about?

Then I tried the beta of PernixData FVP.  It worked as advertised, was a simple installation, did what it was supposed to – kind of.  I noticed that it seemed like only the very last node I rebooted was the one with FVP running on it.  I did some tests, did some more installations, and watched as the next host I rebooted became the only one with the software running. 

So, given it was beta, I reached out to support – and support from PernixData was great.  Given all the troubleshooting I’d done, I gave them all the information I could find: screenshots, logs, processes, steps and sequences.  I’ll be damned if they didn’t come back pretty quickly with a suggestion – I must have duplicate UUID’s on the hosts.  Bullocks I say, ESXi has been happy, no complaints, no worries, whatever do you mean.

Support says “browse to: /mob/?moid=ha-host&doPath=hardware%2esystemInfo">https://<host>/mob/?moid=ha-host&doPath=hardware%2esystemInfo, and confirm the UUID string is different on each host”.  No problem:





Well I’ll be damned –

uuid string "4c4c4544-0038-5410-8030-b4c04f4d4c31"
On all 4 nodes.  Okay so that IS my problem. 

VMware even has a KB on it –  Not that this is a “Whitebox”, but it certainly is an OEM custom, by definition.  So we’ll go with that. 

See, on a C6100 you have a typical Dell Service Tag – eg: ABC123A for the chassis.  But each ‘sled’ has a .# after it.  So you’ll have ABC123A.1, ABC123A.2, ABC123A.3, and ABC123A.4.  Turns out this makes ESXi assign the same UUID.  Some Googling tells me that this is also apparently an issue for SCVMM and SCSM.  As DCS never really intended these systems to end up in “Enterprise” or “Home Lab” hands, but very large cloud providers, there’s no reason to care.  And fairly enough, it didn’t have any impact on my normal vSphere lab. 

Now.  How the heck do you update it?  The BIOS doesn’t give you an option.  Some posts on the internet suggest you could upload a new BIOS and specify it then, but that didn’t work out.  Dell was no help – and I don’t fault them one bit.  The system is used, off warranty, and used by someone it wasn’t intended to be supported by.  That’s fully on me, I have no complaints.  But I still wanted it fixed. Smile

I spend a lot of time at and this is a good place for a wealth of C6100 information.  A thread caught my attention where it noted these issues.  One particular post by TehSuk caught my attention –  Apparently you can just run the Windows version of IPMIUTIL.exe with the following options:

ipmiutil.exe fru -s %newassettag%

Reboot, and you’re good to go.  No such luck.  See, the user in question notes that he’s a Windows shop.  No such luck with ESXi.  So I tried making a DTK bootable ISO from Dell using some information they had, but that wasn’t working.  Various issues from the methods being written a while back and not supported on Windows 8 (which took me a bit to figure out that was my issue) to the tools having issues with creating a 32bit ISO on a 64bit system due to environment variables, DLL’s not found, etc.  Nothing the end of the world, but I didn’t like that path. 

Then I remembered that you can use IPMIUTIL.exe across a network.  I had no luck when I tried months ago, so why would it work now?   Other than I’ve now spent more time playing with the utility. 



ipmiutil.exe fru –N <hostname/IP> –U <user> –P <password>

Was able to get me a listing which included “Product Serial Num”.  So could I use the same “fru –s %SERNUM%” suggested by TehSuk? 

ipmiutil.exe fru s AAAAAA3 –N <hostname/IP> –U <user> –P <password>


Sure enough, it will change “Product Serial Number” to AAAAAA3.  So let’s reboot and find out what it says.

After updating the first 3 nodes, and checking the MOB link, looks like we have success:









No need to change it – leave it with the original Service Tag, as it no longer conflicts. 


So in the end, all you’re going to need is:

And run the above IPMIUTIL.exe FRU commands, and you should be good to go.  I haven’t checked if PernixData FVP now works better for me yet as it’s late – but here’s hoping it does.  If nothing else, the UUID’s are now different, as they should be!

BTW, please don’t read any of this as though I was disappointed with PernixData FVP – heck, if anything they helped me find this issue, pointed me in the right direction, and I wanted their software to work because my testing showed it made an AMAZING difference.   I’m looking forward to retrying the software across all 4 nodes.