Archive

Archive for the ‘Storage’ Category

HOWTO: Migrate RAID types on an Equallogic array

October 17, 2014 Leave a comment

I’ve run into a situation where I have a need to change RAID types on an Equallogic PS4100 in order to provide some much needed free space.  Equallogic supports on the fly migration as long as you go in a supported migration path:

clip_image002

  • RAID 10 can be changed to RAID50 or RAID6
  • RAID 50 can be changed to RAID6
  • RAID 6 cannot be converted.

By changing from RAID50 to RAID6 on a 12x600GB SAS unit, we can go from 4.1TB to 4.7TB, which will help get some free space and provide some extra life to this environment. 

1) Login to the array, and click on the MEMBER, then MODIFY RAID CONFIGURATION:

clip_image004

Note that the current RAID configuration is shown as “RAID 50” and STATUS=OK.

2) Select the new RAID Policy of RAID6:

clip_image006

Note the change in space – from 4.18TB to 4.69TB, and a net change of 524.38GB, or about 12% extra space.  Click OK.

3) During the conversion, the new space is not available – which should be expected:

clip_image008

After the conversion, the space will be available.  Until then, the array status will show as “expanding”, as indicated.  Click OK.

4) You can watch the status and see that the RAID Status does indeed show “expanding” and a PROGRESS of 0%:

clip_image010

After about 7 hours, we’re at 32% complete.  Obviously this is going to depend on the amount of data, size of disks, load on the array, etc.   But we can sarely assume this will take at least 24 hours to complete. 

5) When the process completes, you will see that the RAID Status is OK as well as the MEMBER SPACE area will show free space:

clip_image012

Understandably, you now need to use this space.  It won’t be automatically applied to your existing Volumes/LUN’s, so you’re left with two obvious choices – grow an existing volume or create a net new one.  As it is expected creating net-new is understood, I’ll demonstrate how to grow an existing volume.

6) On the bottom left of the interface, select VOLUMES:

clip_image014

Then in the upper left, expand the volumes:

clip_image016

Select the volume you wish to grow.  I’ll choose EQVMFS1.

clip_image018

Click MODIFY SETTINGS and then the SPACE tab.  Change the volume size accordingly.  It does indicate what the Max (1.34TB) can be.  I would highly recommend you reserve at least some small portion of space – just in case you ever completely fill a volume you may need to grow it slightly to even be able to mount it.  Even if small, always leave an escape route. 

Click OK.

clip_image020

You are warned to create a snapshot first.  As these volumes are empty, we won’t be needing to do this.  Click NO.

clip_image022

Note the volume size now reports as 1.3TB.

7) Next, we go to vSphere to grow the volume. 

Right click on the CLUSTER and choose RESCAN FOR DATASTORES:

clip_image024

Next, once that completes (watch the Recent Tasks panel), select a host with the volume mounted and go to the CONFIGURATION -> STORAGE tab.  Right click on the volume and choose PROPERTIES.

clip_image026

8) Click INCREASE on the next window:

clip_image028

Then select the LUN in question:

clip_image030

NOTE that in this example, I’m upgrading a VMFS3 volume.  It will ultimately be blown away and recreated as VMFS.  But if you are doing this, you will see warnings if you try to grow about 2TB, as it indicates.  Click NEXT.

clip_image032

Here we can see the existing 840GB VMFS as well as the new Free Space of 491GB.  Click NEXT.

clip_image034

Choose the block size, if it allows you.  Again, this is something you won’t see on a VMFS5 datastore.  Click NEXT and then FINISH.

9) As this is a clustered volume, once complete, it will automatically trigger a rescan on all the remaining cluster hosts to pick up the change:

clip_image036

You don’t have to do anything for this to happen. 

And that’s really about it.  You have now expanded the RAID group on the Equallogic, and added the space to an existing volume.  Some caveats of course to mention at this point:

  • Changing RAID types will likely alter your data protection and performance expectations.  Be sure you have planned for this.
  • As noted before, once you go RAID6 you can’t go anywhere from there without a offload and complete rebuild of the array.
  • If you hit the wall, and got back ~ 10%, this is your breathing room.  You should be evaluating space reclamation tactics, new arrays, etc.  This only gets you out of today’s jam.
Categories: Dell, Equallogic, ISCSI, Storage, vSphere

Got 10GbE working in the lab–first good results

October 2, 2014 12 comments

I’ve done a couple of posts recently on some IBM RackSwitch G8124 10GbE switches I’ve picked up.  While I have a few more to come with the settings I finally got working and how I figured them out, I have had some requests from a few people as to how well it’s all working.   So a very quick summary of where I’m at and some results…

What is configured:

  • 4x ESXi hosts running ESXi v5.5 U2 on a Dell C6100 4 node
  • Each node uses the Dell X53DF dual 10GbE Mezzanine cards (with mounting dremeled in, thanks to a DCS case)
  • 2x IBM RackSwitch G8124 10GbE switches
  • 1x Dell R510 Running Windows 2012 R2 and StarWind SAN v8.  With both an SSD+HDD VOL, as well as a 20GB RAMDisk based VOL.  Using a BCM57810 2pt 10GbE NIC
    Results:
    IOMeter against the RAMDisk VOL, configured with 4 workers, 64 threads each, 4K 50% Read/50% Write, 100% Random:

image

StarWind side:

image

Shows about 32,000 IOPS

And an Atto Bench32 run:

image

Those numbers seem a little high.

I’ll post more details once I’ve had some sleep, I had to get something out, I was excited Smile

Soon to come are some details on the switches, for ISCSI configuration without any LACP other than for inter-switch traffic using the ISL/VLAG ports, as well as a “First time, Quick and Dirty Setup for StarWind v8”, as I needed something in the lab that could actually DO 10GbE, and  had to use SSD and/or RAM to get it to have enough ‘go’ to actually see if the 10GbE was working at all.

I wonder what these will look like with some PernixData FVP as well…

UPDATED – 6/10/2015 – I’ve been asked for photos of the work needed to Dremel in the 10GbE Mezz cards on the C6100 server – and have done so!  https://vnetwise.wordpress.com/2015/06/11/modifying-the-dell-c6100-for-10gbe-mezz-cards/

Upgrading NetApp SnapDrive for Windows

September 11, 2014 Leave a comment

I was recently asked to perform SnapDrive upgrades against systems in preparation for a NetApp Data ONTAP upgrade.  As I’ve been through this a time or two before, I knew it wasn’t quite as simple as “upgrade” SnapDrive.  I figured I’d share some information that might help someone have a smooth transition when they do it as well. 

The first thing to do is check out the NetApp Interoperability Matrix Tool (IMT).  This will help you determine all of the supported and required versions. 

The next thing you’ll need to know is what versions are supported.  As of Sep 2014, you’ll want to also look at:

  • SnapDrive for Windows – v7.0.3
  • DSM MPIO – v4.1
  • ISCSI Host Utilities – v6.0.2
  • These are the versions you’ll need to support Windows 2012 R2 and ONTAP v8.2.x 7-Mode.

Next, you’ll need to perform an inventory against your systems to determine what versions or even if some of the above tools are installed.  It’s not uncommon to find situations where SnapDrive might be installed, but the other tools may or may not be.  Equally, the versions may or may not be what you’re expecting to find or standardized.  Some of this will depend on who installed what, and when.  Also, you’re likely to run into some older systems such as Windows 2003 or 2003 R2 that are running older versions, and have no upgrade path.   So you may need to come up with a migration or lifecycle strategy for some of your systems.

One of the most frustrating things I’ve found from these events in the past, is that NetApp doesn’t have any log of what is installed.  That’s understandable for the Windows systems, but these tools interact with a NetApp controller.  For systems that don’t – perhaps they don’t any longer, they’re orphaned, they’ve been P2V’d, etc – that’s understandable.  But it would be GREAT if NetApp could be so kind as to log connections from systems and what version of the software is involved.  Something akin to WSUS, but at the controller level – even if just logged to the MESSAGES file.  Either way, that’s just dreaming Smile

The above software is going to have some pre-requisites and requirements.  These include:

  • .NET 3.5 is required – likely, this already is, if you’re upgrading.  If you’re installing clean, then it may not be present. 
  • Various Windows HotFixes, as identified by the installation guides
  • Knowing what your ISCSI timeout settings need to be – my environment has been tested to prefer 190 seconds.
  • I’ve created silent installer batch files for each of the three application installations and a script to verify the installation.  It likely can’t be distributed via SCCM or similar tools at this point, but it’s pretty close.  Expect to see these posted shortly:

HOWTO- Silent Installation for NetApp Data ONTAP DSM for Windows MPIO v4.1

HOWTO- Silent Installation for NetApp Windows Host Utilities 6.0.2

HOWTO- Silent Installation for NetApp SnapDrive for Windows v7.0.3

HOWTO- Verify NetApp SnapDrive for Windows 7.0.3 Installations were successful

I’ve tested these against Windows 2008 R2 SP1 and 2012 thus far, with no issues.

With luck, I can help someone else’s upgrades go smoothly. 

For those that know me – this would be a GREAT place for an obligatory comment like “Ditch the LUN’s, VMDK’s don’t have this issue, and it’s time to stop living in a world with physical constructs, lack of portability, and vendor lock in.”  I like NetApp products, but I like it better when it has VMware on top of it, and the Windows systems don’t have a clue!

Categories: ISCSI, NetApp, SnapDrive, Storage

Does ALL of your data need to be on a SAN? Using DAS for uptime.

April 4, 2014 4 comments

Are you still using SAN’s for everything? Should you? This is something I’ve talked about with a few of my peers and have been brainstorming to come up with better ways to do things more efficiently. It doesn’t mean it’s right, best-practice, or good in all situations. But it’s worth throwing up for discussion and to make people think about design differently.

See, I like virtualization. I think the magic that comes from vMotion and svMotion and similar technology is pretty great. But it also comes with costs. Outside of licencing, the next biggest cost is usually storage. You never have enough, even if you had more money, it’s not always easy to add more, you may not be able to get the right storage, or it might not get you what you need. Many companies will do Tiers of storage – Gold/Silver/Bronze, etc. But it’s usually all kept on the SAN – maybe tiers on the same SAN/NAS (I’m just going to say SAN for the purposes of this discussion) or maybe the lower tier is on an older unit or BrandX or what-have-you. If the data NEEDS to be on the SAN, that’s fine. But the question to ask is DOES it need to be on the SAN at all?

I’ve been in a few environments, and seen different options. In one, all hosts had local disks. If you were doing SAN maintenance, you offloaded all the data from the SAN to local disk with svMotion, and then did your SAN maintenance. This, to me, was mind boggling – you pay for enterprise features and 5-9’s uptime and online live upgrades and then… run scared every time? This wasn’t for me. In my current environment, we perform quarterly outages where two of us get trusted to basically take apart or do whatever is required (pre-approved and peer-reviewed, of course) from 8PM to 8AM once a quarter – provided we don’t break anything and when we hand it back, it’s running. This doesn’t always result in full outages, but it allows for the possibility. While things like SAN or fabric maintenance won’t necessarily take down anything, this is also a good time to ensure that VMware Tools, VMware hardware, Windows Updates, etc, are done that couldn’t otherwise be done transparently outside of the quarterly maintenance. This works very well, and I like the process. But of course, often the question will come up from somewhere “well, is it a FULL outage?” or “Can SOME of the data stay up?”, etc.

The issue here is one of application resiliency. It’s not really a SAN problem or fabric, or vSphere HA/vMotion/DRS issue. The Windows system may or may not have been architected in a way that makes it transparent. It could be a lack of NLB of some sort, no clustering, or older applications – we all have them. But for those things that DO have application based high availability – how can you leverage that? Exchange DAG’s, SQL Always On AG’s, Domain Controllers, Web Sites, File Shares that aren’t on NAS, etc. The ways I’ve come up with or heard include:

· Build two separate clusters, put one half of the application in each. This handles issues with the cluster itself, and anything internal to the application. But it’s quite likely these clusters are in the same rack – on the same power, network/storage networks, and SAN. So… you likely haven’t bought much if your SAN upgrade results in a panic.

· Use redundant network/storage networking and SAN’s. HAH. Yeah, that’s not in the budget. But it certainly would give you two silos that you would only do maintenance on one at a time, and if you knocked something over, you’d be okay.

· Don’t use a SAN, use DAS. Now your HA can be in a much smaller unit – perhaps on a 2U box with internal disks, on separate power. THIS is what I want to take a look at.

The obvious caveats that will be pointed out:

· “One box can’t possibly run my entire data center” – correct. But am I trying to keep “everything” up, or just “core services”? I probably don’t care if the mail archiving solution is up, if Exchange is still up to Outlook/OWA/ActiveSync users.

· “You’ll never get enough storage for everything” – correct. Again though, not trying to keep a duplicate of everything on an Enterprise Tier 1 SAN that we’ve consolidated onto.

· “You’ll never get the performance out of it – the IOPS will kill you” – probably. But like that mini-spare in your trunk, it’s just trying to get you to the service station, not take you coast to coast on a road trip. Same applies here – you just need to ensure that services ARE up. Also, we have options such as PCIe SSD and vFlash or other 3rd party solutions to leverage SSD caching/acceleration. If it’s good enough for modern/hybrid SAN’s…

· “But your network can still take you down” – yes, it could. An option is to connect this host to different switches and/or ports on your core/distribution/firewall, and have it function segregated. Design what will work for YOUR environment.

· “But you’ll lose the ‘SAN Magic’ like snapshots, deduplication, compression, etc” – you will. And I’m going to suggest you don’t care. You’d still have your “Primary Node” on SAN. THAT system gets all these benefits. Why pay twice for it, to be stored in the same location, and be in the same amount of risk of SAN failure, or Admin operator error, etc?

· “Without shared disk, you get no LUN’s, so you can’t do clustering” – correct. If what you want to do is clustering that requires shared disks. There are many modern applications that do not do this, I’ll talk about them later.

So why am I bringing this up? Well from numbers I’ve crunched and seen from peers, “Enterprise SAN Storage” costs “about $6/GB”. You get benefits like thin provisioning, deduplication, compression, snapshots, replication, so the effective cost likely comes down, but let’s assume that number to be true. Let’s also assume that your typical vSphere Rack Server for your environment is around $10K without disks, booting from SD. Also, that a similar box, but outfitted with 12x 3.5” 3TB NL-SAS 7200RPM disks is around $20K. You might do.. RAID6 or RAID50, with a hot spare or two. So we’ll assume you get 8 usable disks worth of capacity – ~ 23TB. Now we have some numbers to play with. 

(You could also use a 25×2.5” chassis with 1.2TB 10K SAS if you needed speed over capacity, with an understandable cost jump – you’d still see about 24TB with RAID50 and 3×8 disk RAID5 stripes and a single hot spare – but 3100+ IOPS vs 650+.  Also, external disk shelves could be used.)

First, the difference between the $10K for the base host and $20K for the host with disks is all you should factor for. We can reasonably say that without the disk, this would just be another cluster host, running some of your cluster workload – including this “application HA” systems – either way, we were buying the node. If we go with internal disks, some of the workload could be DAS based, and some could even still be cluster based connected to the SAN. We take $10K, divide it into 23TB to get a cost per GB – about $0.43/GB. That same 23TB at $6/GB for SAN space would be about $141,312. I bet I have your attention now. Now that we know what it costs, we can start talking about ways you could use it…

Some ideas on how this system could be used – either for survivability or just for reduced cost during normal operations:

· Secondary Domain Controller

· Exchange DAG Mailbox role holder

· Exchange HUB/CAS role holder

· “Mail Archive” – eg: Enterprise Vault, Exchange Enterprise Archiving. (this is one of the ‘non-outage’ examples where you would use this as “Tier X-1”)

· Print Server – maybe you just svMotion this over during maintenance, it’s likely quite small

· Windows 2012+ based DHCP Failover with load balancing

· SQL Servers that utilize SQL Server 2012 Enterprise AlwaysOn Availability Groups

· SQL Servers pre SQL Server 2012 that use Active/Passive mirroring with log shipping that you would manually sync and cutover to

· Windows File Shares that could be a DFS-R/DFS-N replica of what’s on the NAS (could be another example of a ‘non-outage’ use)

· Secondary RADIUS, PKI, NAC, NAP, WSUS, etc, servers

· IIS servers that sit behind some sort of load balancer

· Network Monitoring Services?

· Phone/Unified Communications/Lync/etc secondary node

As you can see, that’s a pretty decent list of services that you could likely say “These WILL be up during maintenance”. You could do their patching/updates/maintenance the weekend before or after the normal quarterly outage, and no one would know.

The critical piece here though, is ensuring that your applications are resilient. Too many times I’ve seen things like:

· Clusters that have non-cluster aware services/apps pointing at them, that don’t tolerate a cluster failover (which negates the benefit)

· NLB’s that _balance_ two or more hosts, but aren’t built for failure – only normal operation. So if a service stops serving the correct web page, and only shows a “404 not found” web page – well, it replied with a web page, so the service must be healthy! (not so good for the ‘user experience’)

· NLB’s or clusters where, for some reason, something was installed that only runs on one node. It’s not cluster or NLB aware, and there isn’t even a second copy of it to fire up manually. Suddenly, your ability to do rolling maintenance is gone.

Virtualization and modern datacenters bring a lot of ‘magic’ to the table. But it’s largely for “infrastructure”. If your applications and/or OS isn’t built to allow your clients seamless connectivity, all the redundant controllers, power, hosts, etc, in the world, won’t help you at all when an Admin reboots the wrong machine or a BSOD takes it out. You NEED to build for that.

Another use case for this “DAS Silo” is what people SAY is archiving, but isn’t. What I mean, is where they take things from some share and move it into a folder on the same share called “ARCHIVE” but it’s still in the same path. Or maybe it’s some sort of link to another path, but it’s on the same NAS. Or on another NAS, but it’s just as expensive as the first one! I’ve seen this with general user documents, and with Exchange mail archiving solutions. I’ve been in meetings where I’ve asked questions like “So can the archive go offline at all?” (no) “Can the archive be slower?” (no) “Can the archive be in another location?” (no) “Can the archive have the path name change at all?” (no) – which only lets me draw the conclusion that most users just want to append the word “-archive” to a folder/file name and pat themselves on the back. But that doesn’t help us, the data center administrators who need to manage it, and it’s why it ends up staying on the same Tier. Usually because there just isn’t another tier available, and adding one would be too complex and/or costly.

But what if you could actually MOVE that Archive data? Take your Exchange Archive or Enterprise Vault and put it on DAS? It’s still getting backed up. It’s still available. It’s just got maybe slightly less availability. But if someone’s really trying to tell you that you can’t take down historical mail, deleted user data, and archived data from 2-5+ years ago, for 2 hours in the middle of the night on a weekend… we need to recheck the users/business expectations and what they’re willing to pay and contribute to this magical world they believe in.

I know this solution won’t work in all cases. But it’s a different way to think about the problem and how to solve it. The above solution has the potential to be about $130K cheaper than “doing what we have always done”. Using the $10K hosts, and assuming you have as many as 5-7 in your cluster today (that already have licencing), you may just save enough on storage to replace every compute node and half your back end networking, for ‘free’. Assuming that’s not reasonable, another way to look at it is “for a cost of $280/month over 3 years, I can ensure that during maintenance, we have x # of core systems still reachable at all times.”. If you can’t sell $300/month so that the C-level can be sure their phones still get e-mail at 3AM, which makes them happy – you shouldn’t be selling. That’s almost in budget for my home lab… (if I had some reason to require said uptime)

You know that old adage “if you always do what you’ve always done, you’ll always get what you’ve always gotten”? Maybe it’s time to do something different…

For what it’s worth, I do this as much as possible in my home lab. A 4 node Dell C6100 is the cluster and a single node C2100 with 12x3TB runs everything that can or wants to be “off cluster”. Between this, and DPM with DRS which powers off most of the cluster when there’s no load, I can shut down everything but the firewall and move the C2100 into the LAN/Trusted port and I can still be 100% “Up” even if I ripped apart the rest of the rack for a weekend – which tends to make upgrades a lot less stressful!

Ditch the LUN!

October 7, 2013 1 comment

http://blogs.computerworld.com/cloud-storage/22895/designing-cloud-storage-ditch-lun-cfbdcw

I couldn’t be any happier to see this post. ESPECIALLY given who’s posting it. John Martin is a Principal Technologist for…. NetApp ANZ (Aus/New Zealand).

So many quotes:

“Unfortunately, the efficiencies brought by virtualization at the server layer weren’t matched by traditional storage approaches.”

“Now SANs are great, but they were never designed for virtualization”

“As a result, they’re dedicated, expensive, often over-engineered, and rarely flexible.”

In the end, he’s suggesting NFS – which is a good idea where you can use it. It allows for no LUN’s, you can extend OR SHRINK it as needed, without extents. It is thin provisioned by default (though, design and monitor for that!). The only real downsides I find typically are that a) it’s not MPIO capable, so if you have 2+ links in your SAN, plan for that and/or design for LACP end to end or something, and b) it often seems to be the red-headed step child even for VMware – usually VAAI and other features come to block before they come to NFS.

But basically – don’t use LUN’s.  Put your VM’s inside of a VM container.  Make them portable and storage agnostic.  Use the storage to replicate or snapshot your VM volume, but not individual disks on your VM’s, and your life will be a LOT easier in the long run!  It’s just too hard to be nimble and migrate your VM’s to Private Cloud, your cloud, Public cloud, DR to another site with different hardware or a different generation, merge with another company, etc, if you’re hung up on what type of block object you’re going to wrap your disks in – usually because your DBA think he’ll get better performance in a silo rather than by aggregating and using the total of the resources.  It’s not the 20th century…

Looking forward to his next post, but I can imagine what it reads like.

Categories: NetApp, Storage, VMware, vSphere