Home > CIFS, NetApp, Windows2008R2 > HOWTO: Troubleshoot high CIFS IOPS on NetApp

HOWTO: Troubleshoot high CIFS IOPS on NetApp

In my office, we’ve been keeping an eye on some high usage on our EDM based NetApp, particularly due to backup times with NDMP and NetBackup to tape.  In doing so, from time to time, I see other issues.  One that comes up is an unusually high CIFS (Windows File Shares) usage.  In order to troubleshoot this, and get to the root of the issue – much like one might with SFLOW or a sniffer on the network – these are the steps you can take.

1) Obtain the system stats on the NetApp from the console, by running “sysstat –c 12 –x 10”.  –c 12 indicates 12 iterations or “counts”, and –x 10 indicates “every 10 seconds”

clip_image001

I’m sure that at first, these results look a little confusing – so I’ll break them down:

· GREEN shows the NFS OPS per Second.  Because VM backups are occurring at night (during this example), it is expected that there will be high NFS load on the system.

· ORANGE shows the NET KB/Sec OUT – the activity that would be used by VM’s on NFS performing READS FROM the SAN – thus, NET OUT traffic.

· PURPLES shows the DISK KB/Sec READ.  We expect this to be high both because of NFS OPS, but also because of…..

· GRAY shows the TAPE KB/sec WRITE, which is NDMP traffic OUT, WRITING to the tapes.

So all of this is largely expected and accounted for.  Except why is CIFS, shown in RED, so high?  CIFS is used by Windows users and not NFS, not part of internal SAN activity (eg: “aggr scrub”, “vol move”, “reallocate –p”, SnapMirror/Vault, etc), and is not part of NDMP or the backups.  So – what is the activity?

(Also worth noting – Cache Age is 0s – indicating we are massively overrunning the available system cache, or it would show how many seconds or minutes of cache exists, and 0s is none)

2) From a Windows server where you are logged in with an Administrator account (to make pass-thru authentication to the SAN work better), open Computer Management:

clip_image002

Right click on COMPUTER MANAGEMENT in the MMC and choose “CONNECT TO ANOTHER COMPUTER”.  Enter the name of the NetApp controller – eg: NETAPP1.  Because NetApp has licenced API’s and functionality from Microsoft for their Domain participation and CIFS sharing, much of the CIFS portion of the controller can be managed as if it were a remote Windows Server.

3) Expand SYSTEM TOOLS -> SHARED FOLDERS -> SESSIONS:

clip_image003

Click on the # OPEN FILES header to sort by the highest number.

As you can see, we now know the USERNAME and COMPUTERNAME that has the highest number of files open.  The immediate questions that come to mind here are:

· Why so many users with more than one connection?

· Why so many open files?

· What might the activity be that is causing high CIFS IO, as an open file doesn’t cause IO, it simply has a lock.

4) Expand SYSTEM TOOLS -> SHARED FOLDERS -> OPEN FILES:

clip_image004

If we sort by ACCESSED BY, and look for the highest users by # of open files in the previous step, we know to look for username “saxxx.xxxxx”.  Looking at the files opened, we can reasonably assume that there is something going on with an ArcGIS GDB (GeoDatabase), by the number of GDB files open (had to be obfuscated, sorry).  Likely this user is either a) active or b) running some long-running task overnight.

User “pnxxxx” however, just appears to have open files.

The difficulty here is that none of this tells you WHAT the users are doing.  But it gives you a reasonable place to go and look and investigate.

HOWEVER – something to keep in mind – if a user is say copying a large number of files (eg: robocopy backup, zip archive, etc), the above method MAY NOT find it.  That activity will look like the user opening one file after the other, closing the first, and may never appear as the user having more than one file open at any given time.  These methods above are intended to be a guide, not a solution.

The downside to all of this?  Users who hammer on the system over the weekend, affect the performance of backups.  We try not to do backups during the week so their user experience is good – but the reverse is not always true, when we need backups to get the highest priority and performance.  All of that said – the system is there for the users to perform work for the company, so it is an understandably necessary evil.

5) There IS an option that one can set to get more information though:

NETAPP1> options cifs.per_cifs.per_client_stats.enable off

However, it is recommended by NetApp to leave the “per_client_stats” set to disabled, unless needed as the tracking of these stats …. Puts more load on the system, and thus can slow it down in a situation where you are already troubleshooting poor performance.  It is worth knowing it exists, in case asked by NetApp Support to enable it for troubleshooting.

To enable the option, simply run:

NETAPP1> cifs top
The cifs.per_client_stats.enable option must be on to use “cifs top”
NETAPP1> options cifs.per_client_stats.enable on

As you can see, “cifs top” will not provide any useful information until “per_client_stats” are enabled.  You can safely disable them when you’re done troubleshooting.

NETAPP1> cifs top
ops/s  reads(n, KB/s) writes(n, KB/s) suspect/s   IP              Name
553 |      0     0 |       0     0 |        0 | 172.21.250.45     DOMAIN\svcspaceobserver
    19 |      0     0 |       0     0 |        0 | 172.23.0.67       DOMAIN\jasxxxxx
    10 |      0     0 |       0     0 |        0 | 172.23.0.67       DOMAIN\jasxxxxx
     2 |      1   145 |       0     0 |        0 | 172.21.250.30     DOMAIN\cmxxxx
     0 |      0     0 |       0     0 |        0 | 172.21.1.48       DOMAIN\lawxxxx
     0 |      0     0 |       0     0 |        0 | 172.22.0.65       DOMAIN\saxxxxx
     0 |      0     0 |       0     0 |        0 | 172.22.17.76      DOMAIN\derxxxx
     0 |      0     0 |       0     0 |        0 | 172.21.61.133   DOMAIN\nexexxxxxx
     0 |      0     0 |       0     0 |        0 | 192.168.52.66   DOMAIN\aroxxxx
     0 |      0     0 |       0     0 |        0 | 172.22.17.56      DOMAIN\armaxxxx

When you run “cifs top” you may need to give the system some time to collect those “per_client_stats”, perhaps 60-120 seconds.  But then what you see is shown above.   Clearly, “DOMAIN\svcspaceobserver” is the biggest cultprit here – at 553 OPS/second.  You can see it is NOT doing a lot of KB/sec read or write, but simply crawling the file system results in a lot of “operations”.  This would be one of those situations that would not show up as a high number of open files in Computer Management, as it is massively sequential access, one operation at a time.

Don’t forget to DISABLE the “per_client_stats” once you’re done troubleshooting, as there is no point collecting this information if it will not be used.

NETAPP1> options cifs.per_client_stats.enable off

So the short moral of the story?  Don’t run Space Observer on a share on a NetApp during backups – or it will just compound poor backup performance.   Hopefully this information might help you troubleshoot high CIFS activity in the future.

Advertisements
Categories: CIFS, NetApp, Windows2008R2
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: