ESXTOP Replay and VM-Support output– AKA ‘Pain Train’

@vTrooper from the field here:

In an effort to help one of my customers troubleshoot some performance issues within his ESX farm I asked for a run of the vm-support file with some special parameters.  I expected I could pull a few minutes of esxtop data and replay it locally.  Why didn’t I just do the same thing with the esxtop parameters?  Well I wanted to get some of the specifics of the farm at the same time. I thought I could capture both the configuration and the performance data in a simple request.  Seems reasonable right?

Well the delivery of the vm-support file  yielded a legacy of discovery. Read on if you wish to avoid the ‘Pain Train’

Roadblock 1 – OMG the files are HUGE!

The command was simple and can be reviewed from the ‘man’ page of the ESX system.

vm-support -s -i 10 -d 600

This runs the vm-support output and then captures statistics of 10 second intervals for 10 minutes.  It will make the support file take longer to run but you get that sample of data you need.  Just SCP that zip file output and you are good to go to the next step.

Copying 305MB…….          WTF?!!   It was supposed to be 60 samples.  How did that happen?  How did I get a 3ooMB file compressed??  Checking the ‘man’ page again I forgot to exclude the core dumps and other log files. I should have used:

vm-support -n -s -i 10 -d 600

to exclude the core dumps in the vm-support payload. On 4 ESX systems each output was 300MB,65MB,154MB,91MB.  I’m still not sure why, but lets dive into the output and see what is there. I unzip the payload to my local machine and see what I pulled into the boat…. – OUT OF SPACE–   Wait what?  The 300MB monster expanded to 1.3 GB on disk . Well Poo.  I’ll go get coffee while I move some files around.  Ok there. Got the thing expanded. Now I should be able to find my ESXTOP output.  I browse through the file structure of the payload and find that the ESXTOP output is captured in VSI output files.  Only way to see inside the files is to run ESXTOP.  My windows workstation isn’t ESX so I have to try it another way.  Sheesh.

Roadblock 2 –  Wait!  Where the Hell Am I?!

Fine.  I’ll scp the files over to my ESX 4.0 system locally running in my Vmware Workstation 7.1 instance. This should be a piece of cake…. – OUT OF SPACE–   Really?  Again? Oh yeah; I created a small instance locally so I could jump into the CLI/Console and run a few commands if necessary.  I don’t really run ongoing VM’s here. No horsepower on the laptop, and no space apparently.  I had all my space in the VMFS volume not in the /root directory of the ESX console.  Guess I’ll just load the payload on the VMFS area.  Better make the VMFS datastore larger while I’m at it, ok there.  Whoo Hoo! Fully expanded vm-support file.  NOW I’m Ready!  Let’s run the replay command of ESXTOP:

‘   esxtop –R /volumes/vmfs/<vmsupportextractdir>

‘   –INCORRECT VERSION OF ESXTOP— ‘

Are you kidding me?  The kernel version of ESXTOP can’t translate to different versions of the damn output? Unreal. I better go check what format the customer has on their ESX system, I have the configs in the vm-support output after all.  Let’s see /etc/vmware-release should tell me….

‘  VMware ESX 4.0 (Kandinsky)

Well is that the original version of ESX4 ,Update 1 , or Update 2?  Where was I, with my local version?  Let’s check another place: /proc/version

Linux version 2.6.18-164.ESX (mts@pa-lin-bld530.eng.vmware.com) (gcc version 4.1.2) #1 Thu Mar 11 07:09:06 PST 2010 [ESX Service Console build 240614] ‘

Wait.  That doesn’t match what is on the console screen when I first login to the ESX node.

Let’s try another place: /etc/vmware/ft-vmk-version:

‘ product-version = 4.0.0  ft-version = 208167 ‘

OK.  I’m on Update 1 (208167) and the customer is on Update 2 (261974).   More coffee while I get that downloaded….. –OUT OF SPACE—   ? See Roadblock 1

Removed the big vm-support file now that I moved it to the ESX host and got the correct version updated.  Let’s try the replay again.

‘   esxtop –R /volumes/vmfs/<vmsupportextractdir>

SUCCESS!!  Now I’m cooking with Gas.   I’m able to get through the samples of the ESXTOP replay and see all the elements.

Roadblock 3 – The Phat Phinger!

Now I want to show the Customer the output and findings. All I should have to do is run the replay mode command and port that to batch mode and generate the .csv format of the values I care about. Right?

‘  esxtop –R << vmsupportextractdir>> | esxtop -b  >  foo.csv

After running this command I watched the foo.csv file grow and grow… And GROW .  It didn’t stop.  I thought something was up and checked the file.  I had the esxtop of my local ESX instance plugging data into the foo.csv file.  Not my customers vm-support supplied data. Yup I messed up.

The correct command is this:

‘  esxtop –R << vmsupportextractdir>> -b  >  foo.csv

And if you want to parse out more specifics you can run the ESXTOP command in interactive mode set the parameters you care about and save the .esxtop4rc file and run as such.

‘  esxtop –R << vmsupportextractdir>> -b  -c .esxtop4rc  >  foo.csv

It seemed like a marathon but I still had to finish the job.  I SCP’d the file off the ESX host back to my workstation for the last part…The EPIC Line chart:

Roadblock 4 – DDoS of Data <<ChoKe!>>

Now I had my Customers 4 ESX hosts each with a .csv format of the captured data. All I needed to do now is get those .csv samples into esxplot to get some view of the data.

If you haven’t seen the esxplot tool take a look at the vmware labs site and pull it down Here:

You won’t regret the ease of use and the ablility to jump through the data quickly.  Excel limits the column counts and esxplot accommodates the large datasets better than Excel to my knowledge.

How large are the datasets do you say?  Well.  Massive. Let me explain why:

Each of the 4 ESX hosts were told to sample 60 intervals of the default esxtop values.  I didn’t tell the customer to limit those on capture so they all went into the VSI cache files as the vm-support file was created.  Each .csv file included 17,000 metrics over those 60 samples.

Yup. 4 Million metrics would be parsed into my graph if I didn’t pull the unneccessary stuff out.  17,000+ * 60 * 4 = 4,080,000+   Not bad for 10 minutes of work.

Let’s cut to the chase.  I re-ran my extract from the vm-support files and minimized the data to the HBA’s I was reviewing by limiting the output in the .esxtop4rc file.  With those .csv’s reduced to under 5000 metrics I was much improved in parsing through the data to find the Active HBA’s and get my much desired graph of vmhba data.

Hard Knocks? – You Bet.  Now You Know This:

What else can I say folks.   Don’t try this at home. It probably took me a week to get through all these roadblocks for a simple set of data that was probably easier from the Virtual Center reporting tool.  The idea of having someone run the vm-support file and you tackling it by hand is a severe waste of your time.

Tell them what they’ve learned Bob:

  • You can run the vm-support command and capture performance output
  • You can run the esxtop –Replay command and parse that data in batch mode out to a .csv
  • You can limit that batch sample to a lesser amount with additional parameters in the batch mode command with -c
  • You can make pretty graphs of the 4+ Kazillion metrics that may even spit out some interesting data

And Finally:

I’m going to put this train back on some tracks and go find another  adventure in the field.

@vTrooper out

Post to Twitter Post to Delicious Post to Digg Post to StumbleUpon