RAID 10 your EBS data

When I spoke at Percona Live (video here) on running an E-commerce database in Amazon EC2, I briefly talked about using RAID 10 for additional performance and fault tolerance when using EBS volumes. At first, this seems counter intuitive. Amazon has a robust infrastructure, EBS volumes run on RAIDed hardware, and are mirrored in multiple availability zones. So, why bother? Today, I was reminded of just how important it is. Please note that all my performance statistics are based on direct experience running a MySQL database on a m2.4xlarge instance and not on some random bonnie or orion benchmark. I have those graphs floating around on my hard drive in glorious 3D and, while interesting, they do not necessarily reflect real-life performance.

Why? Part 1. Performance

Let’s get to the point. EBS is cool and very very flexible, but nominal performance is poor and highly variable with average latencies (svctime in iostat) in the 2-10ms range . At its heart, EBS is Network Attached Storage and shares bandwidth with your instance NIC. At best, I see 1.5ms svctime and 10ms await, and at worst…well, at worst you don’t need ms precision to measure it. On top of that, a single EBS volume seems to peak out at around 100-150 iops, which is about what one would expect from a single SATA drive. That’s fine if you’re running a low-traffic website with very little disk activity, but once the requests start to come in, things get a little squirrelly. Add in multi-tenancy and a noisy neighbor can really beat your disk into submission.

So, what’s a lowly Systems Engineer to do when the iowait time starts to pile up? Well, it turns out that those IOPs are initially bound by the disk on the backend and not local NIC traffic, so you can use Linux Software RAID to significantly improve the I/O capacity of your disk (but not the latency or variability…more on this later). For a performance boost, there is a lot of bad advice on the Internet saying you should RAID 0 your disk (because “it’s redundant on the back end”), but to the the discriminating SysEng, that should scream bad idea.

Why? Part 2. Redundancy

Right, so EBS is RAIDed and mirrored in multiple availability zones on the back end, so why do I need to worry about redundancy? That’s great and all, but with the EBS cool factor comes additional complexity and new and unexpected failure modes. The first and most obvious was #ec2pocalypse, otherwise known as the Great Reddit Fail of 2011. If you’re not aware of what happened (and the details are somewhat irrelevant), but a couple months back someone pressed the wrong button at Amazon and a significant percentage of EBS volumes became “stuck” showing 100% utilization and no iops. This failure lasted several days and took out a large number of websites that based their infrastructure on EBS. Most of the data itself was recovered, but a small percentage of people were SOL. So much for redundancy.

Enter RAID10. Yes, it’s slower than RAID0 because you have to write twice. Yes, you are bound by the worst performing disk in the array. But, you can get nearly 1:1 increase in IOPs (up to a point) and gain the ability to recover your data when Amazon drops the ball.

You need proof? “Give me an example,” you say? Let’s talk about what happened to me today. Everything was just peachy all day – performance was within parameters and then at 3:15PM, all of a sudden the database started having random query pile ups. Being in EC2, this was not unexpected, but it kept happening. Traffic was on a decline, but we were expecting big traffic in an hour or so. So, I started looking at the disk. We have a 10-drive RAID10 array on our master DB and 1 of those disks was showing svctime in the 30-100ms range, vs 2-10ms on all the others. BINGO!

I didn’t save the actual iostat output, but sar showed this:

03:15:01 PM DEV       tps avgqu-sz  await svctm %util
03:35:01 PM dev8-133 7.78     0.11  13.49  2.28  1.77
03:35:01 PM dev8-130 6.54     0.09  14.14  2.27  1.48
03:35:01 PM dev8-149 8.34     0.11  12.62  2.08  1.74
03:35:01 PM dev8-132 7.67     0.10  13.29  1.98  1.52
03:35:01 PM dev8-131 8.66     0.11  12.27  1.91  1.65
03:35:01 PM dev8-147 7.13     0.10  13.77  2.13  1.52
03:35:01 PM dev8-129 7.58     0.08  10.56  1.73  1.31
03:35:01 PM dev8-148 8.47     4.30 506.96 54.77 46.36
03:35:01 PM dev8-146 8.17     0.08   9.28  1.38  1.13
03:35:01 PM dev8-145 6.70     0.26  39.36  6.87  4.60

dev8-148 sure looks fishy, eh? (Oh, side note…to align this data all pretty-like, I used the aptly named align, a great tool from the Aspersa Toolkit)

Had this been a single volume EBS or RAID0 volume, we would have been forced to perform a database failover to a secondary master and redirect the application, which would have interrupted sales briefly during an active time. Instead, thanks to RAID10, we have options. Instead of a failover during a period of relatively high traffic, we simply failed out the problem drive. Now we were running on 9 drives and with reduced redundancy, but performance immediately recovered and the stalls stopped. We can replace the drive later and resync the array when traffic is low.

How?

First, you need to create and attach “a bunch” of volumes to your instance. How many? I’ve seen diminishing returns after 8-10 disks, but your mileage (and instance size) may vary. Typical RAID10 rules apply here…you need 2x the total capacity and each disk has to equal 2*(capacity)/(num disks), so if you need 1TB usable and want to use 8 disks, you will need each disk to be 256GB.

Here’s some code to do that. It creates 8x256GB volumes in the us-east-1a zone and then attaches them to instance i-1a2b3c4d

for x in {1..8); do \
  ec2-create-volume --size 256 --zone us-east-1a; \
done > /tmp/vols.txt

(i=0; \
for vol in $(awk '{print $2}' /tmp/vols.txt); do \
  i=$(( i + 1 )); \
  ec2-attach-volume $vol -i i-1a2b3c4d -d /dev/sdh${i}; \
done)

Then, you need to install Linux Software RAID. On Debian or Ubuntu:
apt-get install mdadm

Then, create a new RAID 10 (-l10) volume from 8 disks (-n8):
mdadm --create -l10 -n8 /dev/md0 /dev/sdh*

With any luck, you’ll get a message saying that the array was started. You can verify this by looking at /proc/mdstat and you should see something like this (the numbers in this example are probably off. I pulled them together from some random machines)

cat /proc/mdstat
Personalities : [raid10]
md0 : active raid10 sdh6[5] sdh5[4] sdh4[3] sdh3[2] sdh2[1] sdh1[0]
      1048575872 blocks 64K chunks 2 near-copies [6/6] [UUUUUU]
      [==>..................]  resync = 13.3% (431292736/3221225280) finish=7721.9min speed=6021K/sec

Your disk will spend a lot of time and IOPs resyncing, but you can format /dev/md0 and mount it right away.

This wasn’t meant as a complete guide to Linux Software RAID – if you want to know more, check out The Software-RAID HOWTO.

The Bad

Ok, so the observant among you will realize that by having 8 or 10 disks in the array, all with the potential to have severe performance degradation like this, I have drastically increased the variability of latency. Well, you would be right, but…

  1. I can’t get IOPs any other way in EC2
  2. It is easy to recover from the most common failure mode with this setup
  3. If you care about your data at all, RAID0 (or no RAID) is doing it wrong

Remember, kids…Friends don’t let friends RAID0.

Share

Percona Live NYC 2011

A couple weeks back, I had the fortune of co-speaking at Percona Live NYC 2011 with Mark Uhrmacher, CTO of ideeli on the subject of running an E-commerce site with MySQL in the cloud. Interestingly, and a sign of the times, this was also the first time that I had ever met Mark, despite having worked for him for close to a year since I telecommute from home.

Here are the slides from that talk.

What I wanted attendees to get out of our talk was that you have to expect and plan for all sorts of failure situations when your database is in the cloud. Relative to conventional hosting or datacenters, things in the cloud break more frequently and in ways that are out of your control. AWS gives you the tools to plan and recover from these failures much more easily than having to put redundant physical servers in multiple geographic locations, but they also fail more often.

So, here are a few take-aways, mentioned in the slides

  • RAID 1 or RAID 10 (1+0) your EBS volumes

    Yes, EBS volumes are redundant on the back end, in a data center controlled by Amazon. However, the great EBS outage of 2011 (#ec2pocalypse) proves that you cannot entrust your data to a single technology that is out of your control. Had we RAID0′d our data set, we would have been in much worse shape, because we would have to completely rebuild many of our data sets from backup. So, no, you should not RAID0 (which should rightfully be called AID0, since the R is a fallacy). Yes, you take a performance hit, and you have to deal with lowest-common-denominator performance of the EBS volumes, but the ability to remove a failed or poorly performing EBS volume without losing your data more than makes up for that compromise. With 10 EBS volumes in a RAID 10 configuration, we max out at around 1200-1500 iops. Poor performance relative to physical hardware, but it is manageable.

    If you care about your data, never ever use RAID0. You might as well just point it at /dev/null, which as we know is webscale. Friends don’t let friends RAID0

  • Make sure your important data lives in multiple availability zones and multiple regions.

    During #ec2pocalypse, several instances were able to be recovered by simply pointing the application at data that already existed in another zone.

  • Don’t cross availability zones and regions between your ultimate master and your disaster recovery node.

    If so, (and we were bit by this), you may end up with out of date disaster recovery nodes if your distribution slave is in an affected availability zone. Keep replication chains short and all in one zone/region, except for the DR node, which should be somewhere outside of the master’s zone/region.

  • AWS snapshot backups are awesome. But they don’t help if the API is down. Make sure your data lives in multiple places where you can get at it in an emergency.

Also, I’d just like to say that Percona Live was a great conference. There were some incredibly informative talks. My favorite, by far, was Baron Schwartz’s discussion on using tcpdump to analyze server performance and predict scalability. I was honored to speak in front of a crowd where the average person in the room knows far more about MySQL than I do.

Share

Get Hulu working on Boxee (again)

<rant>In a move that defies any reason, the short-sighted bonehead executives at Hulu (or perhaps NBC, but really…who cares?) decided that they don’t want advertising dollars from the thousands of Boxee and Boxee Box users, and instead, would prefer that people simply pirate their media instead since it is higher quality, easier to get, and has no advertisements. Hey, guys at Hulu…wake up. It’s not 2000 anymore.</rant>

Anyhow, a very smart fellow over at the Boxee Forums figured out how to work around the issue with a little bit of javascript…

Disclaimer: This might make your computer explode, your network implode, and format your nodes. I’m not responsible, nor is jzongker over on the Boxee Forums.

Simply save the following code as hulu.js (download link) and put it in the following location:

Mac /Applications/Boxee.app/Contents/Resources/ Boxee/system/players/flashplayer/hulu.js
Linux [Boxeepath]/system/players/flashplayer/hulu.js
Windows probably [Boxeepath]\system\players\flashplayer\hulu.js in Program Files
Boxee Box Apparently this technique does not work

boxee.browserWidth=1280;
boxee.browserHeight=720;
boxee.earlyTimers = true;
boxee.enableLog(true);

boxee.onInit = function() {
   browser.setConfigChar('general.useragent.override','Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/540.0 (KHTML, like Gecko) Ubuntu/10.10 Chrome/9.1.0.0 Safari/540.0');
}

if (boxee.getVersion() < 5)
   boxee.renderBrowser = true;

boxee.parseBoxeeTags = false;
boxee.autoChoosePlayer = false;

var current    = 0;
var h_width    = 720;
var h_bottom   = 23;
var started    = false;
var active     = false;
var duration   = false;
var is_paused  = false;
var alt_player = false;

boxee.onBack = function()  { boxee.onEnter(); }
boxee.onLeft = function()  { boxee.onEnter(); }
boxee.onRight = function() { boxee.onEnter(); }
boxee.onUp = function()    { boxee.onEnter(); }
boxee.onDown = function()  { boxee.onEnter(); }

wmodeFix = setInterval(function() {
   boxee.getWidgets().forEach(function(widget) {
      zorder_id = widget.getAttribute("id");
      if (zorder_id == 'banner_c')
         browser.execute('document.getElementById("'+zorder_id+'").style.zIndex = 99999;');
   });
}, 500);

boxee.onDocumentLoaded = function() {
   boxee.setMode(1);
   boxee.showNotification("[B]Press Enter to view full screen[/B]", ".", 500);
}

boxee.onEnter = function()
{
   boxee.setMode(0);

   if (boxee.getVersion() < 5)       browser.execute('window.scrollTo(0,50);');    clearInterval(wmodeFix);    boxee.showNotification("[B]Switching to full screen...[/B]", ".", 2);    playerTimer = setInterval(function(){       if (!active) locatePlayer();       else updateProgress();    }, 1000) } function playerReference() {    id = boxee.getActiveWidget().getAttribute('id');    if (id.length > 0)
      return 'document.'+id+'.';

   else if (alt_player != false)
      return alt_player;

   else
   {
      var locateMe = "(function(){objects=document.getElementsByTagName('embed'); for (var i in objects) { if (objects[i].getAttribute('src') == '"+boxee.getActiveWidget().getAttribute('src')+"') return i; }})()";
      locateMe = browser.execute(locateMe);
      if (locateMe > 0)
      {
         alt_player = 'document.getElementsByTagName("embed")['+locateMe+'].';
         return alt_player;
      }
      else
         return 'document.player.';
   }
}

function updateProgress()
{
   if (!duration)
      duration = parseInt(browser.execute(playerReference()+'getDuration()')) / 1000;

   if (duration)
      boxee.setDuration(duration);

   current = parseInt(browser.execute(playerReference()+'getCurrentTime()')) / 1000;
   if (isNaN(current))
      alt_player = false;

   if (current > 0 && !started)
      started = true;

   progress = current / duration * 100;
   alert(progress);
   boxee.notifyCurrentTime(current);
   boxee.notifyCurrentProgress(progress);

   if (started && progress > 99.9)
      boxee.notifyPlaybackEnded();
}

function locatePlayer()
{
   boxee.getWidgets().forEach(function(widget) {
      flashvars = widget.getAttribute("flashvars");
      if (flashvars.indexOf('hulu.com/watch') != -1 && flashvars.indexOf('bitrate=') != -1 && !active) {
         active = true;
         boxee.renderBrowser = false;
         var crop = (widget.width - h_width) / 2;
         widget.setCrop(crop, 0, crop, h_bottom);
         boxee.notifyConfigChange(widget.width-(crop*2),widget.height-h_bottom);
         widget.setActive(true);
      }
   });

   if (active)
   {
      boxee.setCanPause(true);
      boxee.setCanSkip(true);
      boxee.setCanSetVolume(true);
   }

   return active;
}

boxee.onPause = function()
{
   is_paused = true;
   browser.execute(playerReference() + 'pauseVideo()')
}

boxee.onPlay = function()
{
   is_paused = false;
   browser.execute(playerReference() + 'resumeVideo()')
}

boxee.onSkip = function ()
{
   if (is_paused) return;
   update = (duration < 3000) ? (current + 60) : (current + 120);
   browser.execute(playerReference() + 'seekVideo('+update+')');
}

boxee.onBigSkip = function ()
{
   if (is_paused) return;
   update = (duration < 3000) ? (current + 180) : (current + 360);
   browser.execute(playerReference() + 'seekVideo('+update+')');
}

boxee.onBack = function ()
{
   if (is_paused) return;
   update = (duration < 3000) ? (current - 60) : (current - 120);
   browser.execute(playerReference() + 'seekVideo('+update+')');
}

boxee.onBigBack = function ()
{
   if (is_paused) return;
   update = (duration < 3000) ? (current - 180) : (current - 360);
   browser.execute(playerReference() + 'seekVideo('+update+')');
}

boxee.onSetVolume = function(volume)
{
   browser.execute(playerReference() + 'setVolume('+volume/100+')');
}
Share

Kilroy Was Here (Part 2)

Kilroy IIThere’s not much to say about this shot, except that I think it’s pretty cute. We moved and our new place has a passthrough between the kitchen and living room. My 16 month old boy has become quite the climber and likes to stand on the back of the couch and watch us as we are preparing meals.  I took this shot, to complement my previous shot of him in his crib from last December Kilroy Was Here.  This time, it was candid with no additional lighting.  I used a Canon 50mm f/1.8 lens at f/2.0 and ISO 800 on aperture priority mode.  Because I was inside and he is backlit, I pumped up the exposure compensation to +2/3.

I didn’t manipulate the photo very much aside from some white balance, contrast adjustments, and sharpening. Here is the original shot:

Kilroy II (Unedited)

Kilroy II (Unedited)

Share

Amazon EC2 Micro Instances (t1.micro)

Amazon recently announced a new instance type – “micro instances.”  They are wicked cheap ($54 + $0.007/hr for a 1-year reserved instance + $.10/GB per month storage) and finally make Amazon accessible to the non-business user with a few low-traffic websites.  For a typical Ubuntu 10.04 LTS (Lucid) installation with a 15GB root partition, that is only $133.32 a year for your very own server in the cloud!  I’ve been with Dreamhost for a couple years because they are inexpensive and allow shell access and “unlimited” storage*.   However, as a professional Systems Engineer, I’ve been wanting to move to something that allowed me to “own” my server.  There are many VPS (Virtual Private Server) providers out there, including Dreamhost and Linode (arguably the king of Linux VPS), but they never excited me very much. I’ll be honest and admit that I didn’t spend any time performing a detailed cost and feature analysis between the leading VPS providers, though. My day job is working with a couple hundred EC2 instances complete with dynamic spinup and spindown for capacity, so EC2 is a comfortable environment for me.  I’ve been wanting to move into EC2 for a while, but could never justify the cost of a m1.small, though.  Last week, I dived in and have moved all of my hosting over to a t1.micro (t for tiny?) instance.

Here is what Amazon has to say about the new Micro Instances (t1.micro):

“Instances of this family provide a small amount of consistent CPU resources and allow you to burst CPU capacity when additional cycles are available. They are well suited for lower throughput applications and web sites that consume significant compute cycles periodically.

  • Micro Instance 613 MB of memory, up to 2 ECUs (for short periodic bursts), EBS storage only, 32-bit or 64-bit platform”

Amazon has a good deal of information in their FAQ and a very detailed view of usage models in their User Guide.

After a few days with this new instance type, I’ve noticed CPU time is very limited. CPU bursts can only be very brief and it appears that you are penalized when you exceed your quota.  I run a zenphoto gallery that brought my instance to a crawl when trying to batch resize a bunch of images with ImageMagick. It was so bad that php was unable to return simple pages before the 60 second fast cgi timeout on the nginx process.  However, with appropriate caching strategies, these machines are more than capable of running a low traffic website. Using Apache Bench, I was able to get 1000 rpm out of the front page of this blog. That’s with the entire application stack residing on a single machine! I will elaborate more on my configuration in a future blog post.

There are a couple catches with this instance type. Storage is only EBS, which means you have to pay $0.10/GB per month above the cost of the instance time.  Also, like all hosting within Amazon, the individual instances are completely unreliable. You need to make sure that you can recreate your nodes from scratch at any point. For me this means documentation, automation, monitoring, backups, and most of all keeping everything important on a separate EBS volume so it can be moved around easily in the event of an instance failure. Even though the root partition of t1.micro instances is EBS, it is a lot easier to move data around if you don’t have to terminate the old instance before bringing up a new one.


* That’s unlimited for web use – not for backups.  They noticed my 300GB of photo backups and very politely asked me to move them to a backup account and even allowed me to keep the data there for a week while I migrated it.

Share