PowerPath VE Versus Round Robin on VMAX – Round 1

 

This past year, I did an exhaustive analysis of potential candidates to replace an aging HP EVA infrastructure for storage.  After narrowing the choices down, based on several factors, the one that had the best VMware integration, along with mainframe support was the EMC Symmetrix VMAX.

One of the best things about choosing VMAX in my mind was PowerPath.  It can be argued whether PowerPath provides benefits, but most people I have talked to in the real world swear that PowerPath is brilliant.  But let’s face it, it HAS to be brilliant to justify the cost per socket.  Before tallying up all my sockets and asking someone to write a check, I needed to do my own due diligence.  There aren’t many comprehensive PowerPath VE vs. Round Robin papers out there, so I needed to create my own.

My assumption was that I’d see a slight performance edge on PowerPath VE, but not enough to justify the cost.  Part of this prejudice comes from hearing the other storage guys out there say there’s no need for vendor specific SATP / PSP’s since VMware NMP is so great these days.  Here’s hoping there’s no massive check to write!  By the way, if you prefer to skip the beautiful full color screen shots, go ahead and scroll down to the scorecard for the results.

 

Tale of the Tape

My test setup was as follows:

Test Setup for PowerPath vs. Round Robin
2 – HP DL380G6 dual socket servers
2 – HP branded Qlogic 4Gbps HBA’s each server
2 – FC connections to a Cisco 9148 and then direct to VMAX
VMware ESXi 5 loaded on both servers
All tests were run on 15K FC disks – no other activity on the array or hosts

 

 

 

 

 

Let’s Get It On!

(i’m sure there’s a royalty I will have to pay for saying that)

Host 1 has PowerPath VE 5.7 b173, and host 2 has Round Robin with the defaults. Each HBA has paths to 2 directors on 2 engines. I used IOmeter from a Windows 2008 VM with fairly standard testing setups.  Results are from ESXTOP captures at 2 second intervals.

The first test I ran was 4k 100% read 0% random.  All these are with 32 outstanding IO’s, unless otherwise specified.

Here is Round Robin

And PowerPath VE

First thing I noticed was that Round Robin looks exactly like my mind thought it would look.  Not that that means anything.  I do realize that this test could have been faster on RR with the IOPS set to 1, and maybe I’ll do that in Round 2.  As for round 1, with more than twice the number of IOPS, PowerPath is earning its license fee here for sure.

How about writes?  Here’s 4k 100% write 0% random.

Round Robin

PowerPath

Once again, PowerPath VE shows near 2x the IOPS and data transfer speeds.  I’m starting to see a pattern emerge. 😉

How about larger blocks? 32K 100% read 0% random.

Round Robin

PowerPath

PowerPath is really pulling ahead here with over 2x the IOPS yet again.

32K 100% write 0% random

Round Robin

PowerPath

Wow!  PowerPath is killing it on writes!  Maybe PP has some super-secret password to unlock some extra oomph from VMAX’s cache.  😉

Nevertheless, it’s obvious that PP is beating up on the default Round Robin here, so let’s throw something tougher at them.

Here’s 4K 50% read 25% random with 4 outstanding IO’s.

PowerPath

The gap between the contenders closes a bit with this latest workload at only a 24% improvement for PP.  But as we all know, IOPS doesn’t tell the entire story.  What about latency?

4k 100% write 0% random

Round Robin

PowerPath

Write latency is 138% higher with Round Robin!  That’s a pretty big gap.  Is it meaningful?  Depends on your workload I guess.

Scorecard after Round 1

 

So far, PowerPath looks like a necessity for folks running EMC arrays.  I’m not sure how it would work on other arrays, but it really shines on the VMAX.  In some of my tests the IOPS with PowerPath were three times greater than with the standard Round Robin configuration!  I do believe that the gap will shrink if I drop the IOPS setting to 1, but I doubt it will shrink to anywhere near even.  We will see.

In addition to the throughput and latency testing, I also did some failover tests.  I’m going to save that for a later round.  I don’t want this post to get too long.

22 comments Add yours
  1. Great post. I’ve been digging for these kind of test results for a while, but eventually gave up. We’re looking at a VMAXe, but without the hardware, it’s impossible for us to get the numbers ourselves.

    What’s the config of your VMAX?

    1. Thanks Tom.  This particular VMAX is a 2 engine with a mix of EFD, 15k FC, and SATA. 

      The tests were run on a thin device, pre-allocated with IOmeter test data and bound to a pool of around 100 15k FC spindles with around 400 TDATs RAID 7+1.

  2. Nice post, but I am curious as to why the results taper off once you start to add randomness to the workload. Can we see what a more real-world VMware workload would look like, i.e. at least 70% random? 

    1. Well for one, my CPU usage spikes up to near peak on the random one I did. So I would imagine the results are constrained more by that than multipathing. I probably could add some cores and try a heavier load. Maybe in the next round. Thanks for the idea!

    2. probably disk IO bound, less random = easier for cache to handle, once it becomes more random it comes down to raw disk throughput (EFD / 1xK / SAS/ SATA

  3. These numbers look very poor to me, only 42MB/s for sequential write IO that goes straight to cache? I got from a oldish midrange Netapp 3160 with just 20 disks better performance (120MB/s)!

    1. This box was just powered on a week ago.  Nothing has been done as far as performance tuning.  I threw some devices in a pool and started these tests so I could give an answer on a PP PO. 

      Haven’t really tried to max out throughput or anything yet.  So you pushed a 20 spindle NetApp to 30,000 IOPS?  Impressive.

      1. Well, yes, because they were sequential writes and NetApp is optimized for writes.

        But even with little caching on the array you should get 100disks*175=17500 IOPS from the worst random workload, that’s why I said there is something fishy with the results.

  4. At the risk of apples to oranges to pear comparisons… Looking back at some recent data whilst troubleshooting for a customer, an AIX box using native MPIO writing to XP 24000 (HDS OEM), I’m seeing the attached.   Average
    write times of less than a second.  What I don’t understand is why it would take the VMAX 4.75 ms to ack a write on
    round robin writes?  That’s a long code path and that write time is approaching or near the write time of writing to raw disk – I would think.   Finally, why is PP write time so much longer than native MPIO in AIX?  Heh…

    1. It looks like we’re using PerfMon here for measurement, so the timer starts when the I/O enters the host SCSI queue (I assume) rather than when it enters the array.  Given that the queue depth has been configured as 32 in some tests, the queue time on the host would be a large chunk of that response time.  If you got the stats from the array, I’m sure the response times would be a lot lower.  Perhaps the scenario you’re talking about with AIX is measuring the response time from a different point in the stack (..?)

      1. Using filemon.. its at the file level and of course the image/screenshot didn’t upload.  You’d think all things
        being equal the write times for both would be equivalent.  But yet, the non-powerpath tests take quite a bit longer.
        VMAX has large caches on the backend, what’s the deal?  Is Enginuity tuned for Powerpath?  Does it allow for 256, 1024 pending writes for Powerpath before it forces destaging?  non-Powerpath only allows 2, 4 , 8 pending writes before destaging?  I’d run the tests and jack the pending IO up and down and see what the tipping point is.  There’s a number of ways to run tests to reverse engineer what is going on, or at least having better conjecture  B^)

    2. Good questions.  I do know the latency is higher because I have
      32 outstanding requests on most of the tests. If I take it down to 1,
      latency is well below 1ms. Maybe a veteran EMC’er can chime in here.

    3. Good questions.  I do know the latency is higher because I have
      32 outstanding requests on most of the tests. If I take it down to 1,
      latency is well below 1ms.

      Maybe a veteran EMC’er can chime in here.

  5. Brandon, If you are planning another set of tests, I’d like to see you use an environment which contains more than 1 LUN, and more than 1 VM per LUN. This would be a more real-life scenario.

  6. Really, I mean the closest workload you tested to “actual” environment was the one test with 25% random activity.  That was only 24% better than native RR.  To me, that is not worth the cost.  In fact, most of my VMs will have 50% random reads – linearly that means only 12% better than native RR. 

  7. @f5adba5a47cb440bd320b743a29c5053:disqus  the reason its 42 mb/s is probably due to the fact the single VM cannot push anymore than that, in order to test the true speed of the VMAX, you would need multiple VM’s and IOmeter with multiple workers running, i dont think the test was to prove throughput it was to simply compare NMP vs PP/VE. 
    @ b. riley GREAT blog post!!!

  8. Brandon,
    Nice write up and great feedback from a lot of people.  Benchmarks are always tricky as there are a million different scenarios that one could test.  Thanks for posting!
    Scott

Leave a Reply

Your email address will not be published. Required fields are marked *