Analyzing HTTP traffic with tcpdump and Percona’s pt-tcp-model

I recently ran into an issue where our request throughput was showing very erratic and spikey behavior despite very smooth response times from the application servers. Using Splunk, we analyzed every log that we had: nginx, haproxy, apache, and the application logs themselves and we were seeing similarly spikey throughput. Because those tools all log upon request completion, there was no way to determine from the logs themselves whether it was one tier of the stack in particular that was delaying request arrival, or if it the spikes were endemic to the traffic we were receiving.

So, we decided to perform some analysis of the raw tcp data on the edge server using a couple of tools. First, was tcpdump, which is a tool that should be in every SysAdmin’s arsenal.

First, grab all the traffic on the interface and write it to a pcap formatted file:

# tcpdump -c 200000 -w output.pcap -i any

This command will capture 200k packets from any interface and write them to output.pcap, which can be later analyzed with a variety of tools, including tcpdump and wireshark.

All we care about is the actual packet count and only for “real” packets (no SYN/ACKs) on port 80. Extract this data from the capture we just made:

# tcpdump -r output.pcap -s 384 -i any -nnq -tttt \
      'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)< <2))
     - ((tcp[12]&0xf0)>>2)) != 0)' > port80.txt

(I stole this command from the pt-tcp-model documentation and honestly have not dived into the details of how the part after ‘tcp port 80′ actually works).

Producing some data that looks like this (IPs hidden to protect the innocent):

...
2011-10-10 12:49:02.662951 IP 175.x.x.x.ppppp > 10.x.x.x.x: tcp 37
2011-10-10 12:49:02.662958 IP 98.x.x.x.ppppp > 10.x.x.x.x: tcp 1380
2011-10-10 12:49:02.662963 IP 98.x.x.x.ppppp > 10.x.x.x.x: tcp 80
2011-10-10 12:49:02.662965 IP 98.x.x.x.ppppp > 10.x.x.x.x: tcp 463
2011-10-10 12:49:02.662968 IP 206.x.x.x.ppppp > 10.x.x.x.x: tcp 516
...

With a little bash, we can aggregate this data per second (I’m sure that there is a much more concise way of doing this, but it gets the job done):

# cut -c 12-21 port80.txt  |awk '{print $1}' |  sort | uniq -c  | awk '{print $2 " " $1}' > packets_per_sec.txt

The output looks like this…packets per second grouped by second:

12:49:02 912
12:49:03 2617
12:49:04 2277
12:49:05 1994
12:49:06 2120
12:49:07 2192

Next, it’s some simple gnuplot magic to chart it all out. Here’s the plot file:

set title "TCP Port 80 packets/sec"
set terminal aqua enhanced title "TCP Port 80 - packets/sec"
set xdata time
set xlabel "Time (EST)"
set timefmt "%H:%M:%S"
set format x "%H:%M:%S"
set ylabel "per second"
set datafile separator " "

set style line 1 linecolor rgb "#000000" lw 1

plot 'packets.txt' using 1:2 title "packets" with line ls 1

set terminal png font "/Library/Fonts/Arial.ttf"
set output "packets_count.png"

replot

Run that with

# gnuplot file.plot

If you are on a Mac, AquaTerm will probably pop up and show you the graph. If not, you can open the packets_count.png file. What I got, looked like this:

Packets/Sec

Packets/Sec

Ugly, eh? I’ve pointed out some problem areas with arrows. The rate of packet arrivals is incredibly variable – much more so than I would expect when the website is receiving 10s of thousands of requests per minute. At such rough granularity, I would expect to see a much smoother line.

That’s great and all…clearly there is a problem, but packets != requests. I want to know how this affects the end user.

Enter pt-tcp-model from the Percona Toolkit (formerly Maatkit by Baron Schwartz, Chief Performance Architect of Percona and author of High Performance MySQL).

In short, this tool will take data from tcpdump and convert the data in the packet headers into time sliced buckets with the number of request arrivals, completions, and other summary data. That can then be charted with gnuplot (or Excel, if you are so inclined) to get some pretty interesting results. To better understand the tool, I recommend you read the documentation and watch Baron’s presentation about Measuring Scalability and Performance with TCP.

First, following the directions verbatim, extract the data into requests and their response times, and slice that into 1 second intervals. One thing to note is that if your source data (port80.txt) contains more than about 300k lines, the tool starts to bog down a bit, so I’d recommend trying to work with smaller samples.

# pt-tcp-model port80.txt > requests.txt
# sort -n -k1,1 requests.txt > sorted.txt
# pt-tcp-model --type=requests --run-time=11 sorted.txt > sliced.txt

Now, you have sliced.txt which looks something like this:

1318265342 18.49   578.542   195   171 0.337054 6.232731 9.455878 0.190278 0.443786 0.337054
1318265343 27.51   527.000   527   526 1.000000 27.511997 27.842869 0.348546 0.583430 1.000000
1318265344 20.75   504.000   504   509 1.000000 20.748874 26.378252 1.166846 0.766312 1.000000
1318265345 23.96   461.000   461   462 1.000000 23.963005 32.181679 3.929070 0.679943 1.000000
1318265346 23.60   438.000   438   423 1.000000 23.595860 26.154968 0.421166 0.939690 1.000000

The columns are in the documentation, but in this case, I’m mostly interested in graphing time vs the number of complete requests arriving (columns 1 and 4).

Here’s some gnuplot for that:

set title "TCP Port 80 - arrivals/sec"
set terminal aqua enhanced title "TCP Port 80 - arrivals/sec"
set xdata time
set xlabel "Time (UTC)"
set timefmt "%s"
set format x "%H:%M:%S"
set ylabel "per second"
set datafile separator " "
set style line 1 linecolor rgb "#000000"

plot 'sliced.txt' using 1:4 title "arrivals" with line ls 1

set terminal png font "/Library/Fonts/Arial.ttf"
set output "lblockups.png"

replot

And here is the chart that I ended up with:

Arrivals/sec

Arrivals/sec

Now…that…is…ugly. When I dig into the raw data, the jagged packet arrival rate is frequently causing 1-2 second delays in the arrival rate of individual requests to our edge server. That is before we even get to nginx, so our app server had no hope. What this means, from a performance standpoint, is that the application stack has to be able to accommodate the huge influx of traffic after the lull. This presents a scalability nightmare, especially for an e-commerce website heading into the holiday season.

Time to call the data center…

Share

RAID 10 your EBS data

When I spoke at Percona Live (video here) on running an E-commerce database in Amazon EC2, I briefly talked about using RAID 10 for additional performance and fault tolerance when using EBS volumes. At first, this seems counter intuitive. Amazon has a robust infrastructure, EBS volumes run on RAIDed hardware, and are mirrored in multiple availability zones. So, why bother? Today, I was reminded of just how important it is. Please note that all my performance statistics are based on direct experience running a MySQL database on a m2.4xlarge instance and not on some random bonnie or orion benchmark. I have those graphs floating around on my hard drive in glorious 3D and, while interesting, they do not necessarily reflect real-life performance.

Why? Part 1. Performance

Let’s get to the point. EBS is cool and very very flexible, but nominal performance is poor and highly variable with average latencies (svctime in iostat) in the 2-10ms range . At its heart, EBS is Network Attached Storage and shares bandwidth with your instance NIC. At best, I see 1.5ms svctime and 10ms await, and at worst…well, at worst you don’t need ms precision to measure it. On top of that, a single EBS volume seems to peak out at around 100-150 iops, which is about what one would expect from a single SATA drive. That’s fine if you’re running a low-traffic website with very little disk activity, but once the requests start to come in, things get a little squirrelly. Add in multi-tenancy and a noisy neighbor can really beat your disk into submission.

So, what’s a lowly Systems Engineer to do when the iowait time starts to pile up? Well, it turns out that those IOPs are initially bound by the disk on the backend and not local NIC traffic, so you can use Linux Software RAID to significantly improve the I/O capacity of your disk (but not the latency or variability…more on this later). For a performance boost, there is a lot of bad advice on the Internet saying you should RAID 0 your disk (because “it’s redundant on the back end”), but to the the discriminating SysEng, that should scream bad idea.

Why? Part 2. Redundancy

Right, so EBS is RAIDed and mirrored in multiple availability zones on the back end, so why do I need to worry about redundancy? That’s great and all, but with the EBS cool factor comes additional complexity and new and unexpected failure modes. The first and most obvious was #ec2pocalypse, otherwise known as the Great Reddit Fail of 2011. If you’re not aware of what happened (and the details are somewhat irrelevant), but a couple months back someone pressed the wrong button at Amazon and a significant percentage of EBS volumes became “stuck” showing 100% utilization and no iops. This failure lasted several days and took out a large number of websites that based their infrastructure on EBS. Most of the data itself was recovered, but a small percentage of people were SOL. So much for redundancy.

Enter RAID10. Yes, it’s slower than RAID0 because you have to write twice. Yes, you are bound by the worst performing disk in the array. But, you can get nearly 1:1 increase in IOPs (up to a point) and gain the ability to recover your data when Amazon drops the ball.

You need proof? “Give me an example,” you say? Let’s talk about what happened to me today. Everything was just peachy all day – performance was within parameters and then at 3:15PM, all of a sudden the database started having random query pile ups. Being in EC2, this was not unexpected, but it kept happening. Traffic was on a decline, but we were expecting big traffic in an hour or so. So, I started looking at the disk. We have a 10-drive RAID10 array on our master DB and 1 of those disks was showing svctime in the 30-100ms range, vs 2-10ms on all the others. BINGO!

I didn’t save the actual iostat output, but sar showed this:

03:15:01 PM DEV       tps avgqu-sz  await svctm %util
03:35:01 PM dev8-133 7.78     0.11  13.49  2.28  1.77
03:35:01 PM dev8-130 6.54     0.09  14.14  2.27  1.48
03:35:01 PM dev8-149 8.34     0.11  12.62  2.08  1.74
03:35:01 PM dev8-132 7.67     0.10  13.29  1.98  1.52
03:35:01 PM dev8-131 8.66     0.11  12.27  1.91  1.65
03:35:01 PM dev8-147 7.13     0.10  13.77  2.13  1.52
03:35:01 PM dev8-129 7.58     0.08  10.56  1.73  1.31
03:35:01 PM dev8-148 8.47     4.30 506.96 54.77 46.36
03:35:01 PM dev8-146 8.17     0.08   9.28  1.38  1.13
03:35:01 PM dev8-145 6.70     0.26  39.36  6.87  4.60

dev8-148 sure looks fishy, eh? (Oh, side note…to align this data all pretty-like, I used the aptly named align, a great tool from the Aspersa Toolkit)

Had this been a single volume EBS or RAID0 volume, we would have been forced to perform a database failover to a secondary master and redirect the application, which would have interrupted sales briefly during an active time. Instead, thanks to RAID10, we have options. Instead of a failover during a period of relatively high traffic, we simply failed out the problem drive. Now we were running on 9 drives and with reduced redundancy, but performance immediately recovered and the stalls stopped. We can replace the drive later and resync the array when traffic is low.

How?

First, you need to create and attach “a bunch” of volumes to your instance. How many? I’ve seen diminishing returns after 8-10 disks, but your mileage (and instance size) may vary. Typical RAID10 rules apply here…you need 2x the total capacity and each disk has to equal 2*(capacity)/(num disks), so if you need 1TB usable and want to use 8 disks, you will need each disk to be 256GB.

Here’s some code to do that. It creates 8x256GB volumes in the us-east-1a zone and then attaches them to instance i-1a2b3c4d

for x in {1..8); do \
  ec2-create-volume --size 256 --zone us-east-1a; \
done > /tmp/vols.txt

(i=0; \
for vol in $(awk '{print $2}' /tmp/vols.txt); do \
  i=$(( i + 1 )); \
  ec2-attach-volume $vol -i i-1a2b3c4d -d /dev/sdh${i}; \
done)

Then, you need to install Linux Software RAID. On Debian or Ubuntu:
apt-get install mdadm

Then, create a new RAID 10 (-l10) volume from 8 disks (-n8):
mdadm --create -l10 -n8 /dev/md0 /dev/sdh*

With any luck, you’ll get a message saying that the array was started. You can verify this by looking at /proc/mdstat and you should see something like this (the numbers in this example are probably off. I pulled them together from some random machines)

cat /proc/mdstat
Personalities : [raid10]
md0 : active raid10 sdh6[5] sdh5[4] sdh4[3] sdh3[2] sdh2[1] sdh1[0]
      1048575872 blocks 64K chunks 2 near-copies [6/6] [UUUUUU]
      [==>..................]  resync = 13.3% (431292736/3221225280) finish=7721.9min speed=6021K/sec

Your disk will spend a lot of time and IOPs resyncing, but you can format /dev/md0 and mount it right away.

This wasn’t meant as a complete guide to Linux Software RAID – if you want to know more, check out The Software-RAID HOWTO.

The Bad

Ok, so the observant among you will realize that by having 8 or 10 disks in the array, all with the potential to have severe performance degradation like this, I have drastically increased the variability of latency. Well, you would be right, but…

  1. I can’t get IOPs any other way in EC2
  2. It is easy to recover from the most common failure mode with this setup
  3. If you care about your data at all, RAID0 (or no RAID) is doing it wrong

Remember, kids…Friends don’t let friends RAID0.

Share

Percona Live NYC 2011

A couple weeks back, I had the fortune of co-speaking at Percona Live NYC 2011 with Mark Uhrmacher, CTO of ideeli on the subject of running an E-commerce site with MySQL in the cloud. Interestingly, and a sign of the times, this was also the first time that I had ever met Mark, despite having worked for him for close to a year since I telecommute from home.

Here are the slides from that talk.

What I wanted attendees to get out of our talk was that you have to expect and plan for all sorts of failure situations when your database is in the cloud. Relative to conventional hosting or datacenters, things in the cloud break more frequently and in ways that are out of your control. AWS gives you the tools to plan and recover from these failures much more easily than having to put redundant physical servers in multiple geographic locations, but they also fail more often.

So, here are a few take-aways, mentioned in the slides

  • RAID 1 or RAID 10 (1+0) your EBS volumes

    Yes, EBS volumes are redundant on the back end, in a data center controlled by Amazon. However, the great EBS outage of 2011 (#ec2pocalypse) proves that you cannot entrust your data to a single technology that is out of your control. Had we RAID0′d our data set, we would have been in much worse shape, because we would have to completely rebuild many of our data sets from backup. So, no, you should not RAID0 (which should rightfully be called AID0, since the R is a fallacy). Yes, you take a performance hit, and you have to deal with lowest-common-denominator performance of the EBS volumes, but the ability to remove a failed or poorly performing EBS volume without losing your data more than makes up for that compromise. With 10 EBS volumes in a RAID 10 configuration, we max out at around 1200-1500 iops. Poor performance relative to physical hardware, but it is manageable.

    If you care about your data, never ever use RAID0. You might as well just point it at /dev/null, which as we know is webscale. Friends don’t let friends RAID0

  • Make sure your important data lives in multiple availability zones and multiple regions.

    During #ec2pocalypse, several instances were able to be recovered by simply pointing the application at data that already existed in another zone.

  • Don’t cross availability zones and regions between your ultimate master and your disaster recovery node.

    If so, (and we were bit by this), you may end up with out of date disaster recovery nodes if your distribution slave is in an affected availability zone. Keep replication chains short and all in one zone/region, except for the DR node, which should be somewhere outside of the master’s zone/region.

  • AWS snapshot backups are awesome. But they don’t help if the API is down. Make sure your data lives in multiple places where you can get at it in an emergency.

Also, I’d just like to say that Percona Live was a great conference. There were some incredibly informative talks. My favorite, by far, was Baron Schwartz’s discussion on using tcpdump to analyze server performance and predict scalability. I was honored to speak in front of a crowd where the average person in the room knows far more about MySQL than I do.

Share

Get Hulu working on Boxee (again)

<rant>In a move that defies any reason, the short-sighted bonehead executives at Hulu (or perhaps NBC, but really…who cares?) decided that they don’t want advertising dollars from the thousands of Boxee and Boxee Box users, and instead, would prefer that people simply pirate their media instead since it is higher quality, easier to get, and has no advertisements. Hey, guys at Hulu…wake up. It’s not 2000 anymore.</rant>

Anyhow, a very smart fellow over at the Boxee Forums figured out how to work around the issue with a little bit of javascript…

Disclaimer: This might make your computer explode, your network implode, and format your nodes. I’m not responsible, nor is jzongker over on the Boxee Forums.

Simply save the following code as hulu.js (download link) and put it in the following location:

Mac /Applications/Boxee.app/Contents/Resources/ Boxee/system/players/flashplayer/hulu.js
Linux [Boxeepath]/system/players/flashplayer/hulu.js
Windows probably [Boxeepath]\system\players\flashplayer\hulu.js in Program Files
Boxee Box Apparently this technique does not work

boxee.browserWidth=1280;
boxee.browserHeight=720;
boxee.earlyTimers = true;
boxee.enableLog(true);

boxee.onInit = function() {
   browser.setConfigChar('general.useragent.override','Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/540.0 (KHTML, like Gecko) Ubuntu/10.10 Chrome/9.1.0.0 Safari/540.0');
}

if (boxee.getVersion() < 5)
   boxee.renderBrowser = true;

boxee.parseBoxeeTags = false;
boxee.autoChoosePlayer = false;

var current    = 0;
var h_width    = 720;
var h_bottom   = 23;
var started    = false;
var active     = false;
var duration   = false;
var is_paused  = false;
var alt_player = false;

boxee.onBack = function()  { boxee.onEnter(); }
boxee.onLeft = function()  { boxee.onEnter(); }
boxee.onRight = function() { boxee.onEnter(); }
boxee.onUp = function()    { boxee.onEnter(); }
boxee.onDown = function()  { boxee.onEnter(); }

wmodeFix = setInterval(function() {
   boxee.getWidgets().forEach(function(widget) {
      zorder_id = widget.getAttribute("id");
      if (zorder_id == 'banner_c')
         browser.execute('document.getElementById("'+zorder_id+'").style.zIndex = 99999;');
   });
}, 500);

boxee.onDocumentLoaded = function() {
   boxee.setMode(1);
   boxee.showNotification("[B]Press Enter to view full screen[/B]", ".", 500);
}

boxee.onEnter = function()
{
   boxee.setMode(0);

   if (boxee.getVersion() < 5)       browser.execute('window.scrollTo(0,50);');    clearInterval(wmodeFix);    boxee.showNotification("[B]Switching to full screen...[/B]", ".", 2);    playerTimer = setInterval(function(){       if (!active) locatePlayer();       else updateProgress();    }, 1000) } function playerReference() {    id = boxee.getActiveWidget().getAttribute('id');    if (id.length > 0)
      return 'document.'+id+'.';

   else if (alt_player != false)
      return alt_player;

   else
   {
      var locateMe = "(function(){objects=document.getElementsByTagName('embed'); for (var i in objects) { if (objects[i].getAttribute('src') == '"+boxee.getActiveWidget().getAttribute('src')+"') return i; }})()";
      locateMe = browser.execute(locateMe);
      if (locateMe > 0)
      {
         alt_player = 'document.getElementsByTagName("embed")['+locateMe+'].';
         return alt_player;
      }
      else
         return 'document.player.';
   }
}

function updateProgress()
{
   if (!duration)
      duration = parseInt(browser.execute(playerReference()+'getDuration()')) / 1000;

   if (duration)
      boxee.setDuration(duration);

   current = parseInt(browser.execute(playerReference()+'getCurrentTime()')) / 1000;
   if (isNaN(current))
      alt_player = false;

   if (current > 0 && !started)
      started = true;

   progress = current / duration * 100;
   alert(progress);
   boxee.notifyCurrentTime(current);
   boxee.notifyCurrentProgress(progress);

   if (started && progress > 99.9)
      boxee.notifyPlaybackEnded();
}

function locatePlayer()
{
   boxee.getWidgets().forEach(function(widget) {
      flashvars = widget.getAttribute("flashvars");
      if (flashvars.indexOf('hulu.com/watch') != -1 && flashvars.indexOf('bitrate=') != -1 && !active) {
         active = true;
         boxee.renderBrowser = false;
         var crop = (widget.width - h_width) / 2;
         widget.setCrop(crop, 0, crop, h_bottom);
         boxee.notifyConfigChange(widget.width-(crop*2),widget.height-h_bottom);
         widget.setActive(true);
      }
   });

   if (active)
   {
      boxee.setCanPause(true);
      boxee.setCanSkip(true);
      boxee.setCanSetVolume(true);
   }

   return active;
}

boxee.onPause = function()
{
   is_paused = true;
   browser.execute(playerReference() + 'pauseVideo()')
}

boxee.onPlay = function()
{
   is_paused = false;
   browser.execute(playerReference() + 'resumeVideo()')
}

boxee.onSkip = function ()
{
   if (is_paused) return;
   update = (duration < 3000) ? (current + 60) : (current + 120);
   browser.execute(playerReference() + 'seekVideo('+update+')');
}

boxee.onBigSkip = function ()
{
   if (is_paused) return;
   update = (duration < 3000) ? (current + 180) : (current + 360);
   browser.execute(playerReference() + 'seekVideo('+update+')');
}

boxee.onBack = function ()
{
   if (is_paused) return;
   update = (duration < 3000) ? (current - 60) : (current - 120);
   browser.execute(playerReference() + 'seekVideo('+update+')');
}

boxee.onBigBack = function ()
{
   if (is_paused) return;
   update = (duration < 3000) ? (current - 180) : (current - 360);
   browser.execute(playerReference() + 'seekVideo('+update+')');
}

boxee.onSetVolume = function(volume)
{
   browser.execute(playerReference() + 'setVolume('+volume/100+')');
}
Share

Don’t reboot your t1.micro [EC2 epic fail]

If you have a t1.micro running an image of Ubuntu 10.04 LTS (Lucid Lynx), don’t reboot it. When I first wrote about t1.micros a few days ago, I forgot to mention that the first instance that I brought up failed, quite catastrophically, upon reboot. I didn’t actually think much of it at the time because I wasn’t that far into configuring the machine. But then, yesterday, Alestic released this note referencing this bug report saying that there is a bug where t1.micro instances running Lucid won’t come back up after a restart and that the bug has been fixed. It’s short, so I’ll let you read it, but basically the cloud-init package was broken and didn’t properly expose the ephemeral0 device causing reboots to fail. Alestic says that all you need to do is do an apt-get update && apt-get upgrade and you’re golden.

Let me tell you first hand…that doesn’t work. This morning, feeling brave, I decided to test the theory out. I was running a t1.micro instance using the old Canonical Ubuntu AMI ami-1634de7f on which I performed an apt-get update and an apt-get upgrade. I saw that the cloud-init package was upgraded, as expected. I initiated a restart and my machine never came back. I initiated a reboot request with ec2-reboot-instances and no dice. Finally, I stopped the instance and then started it with ec2-stop-instances and ec2-start-instances and I still didn’t have any luck. If I were smart, I would have done this with a test instance first, but I was feeling brave and decided I should test my configuration documentation out anyhow. Mostly, I just wanted to make sure that, if my instance was unable to reboot, it did so at a moment when I had the time and ambition to fix it instead of failing at some inopportune time.

Because everything is EBS backed, using an elastic IP, and my documentation is decent, I was able to detach the volumes from the old instance, attach them to the new instance, and get everything running in less than 30 minutes. At some point when I’m feeling very ambitious, I intend to put all the configuration in Puppet to mostly automate the process of migrating to a new instance type, but I’m not quite there yet.

If you have a t1.micro instance running Lucid, my recommendation is to spin up a new instance with the most recent AMI (the most current AMI ID is available at Alestic) and move everything over instead of bothering to perform the apt-get upgrade, which clearly did not work in my case.

Share