RAID 10 your EBS data

When I spoke at Percona Live (video here) on running an E-commerce database in Amazon EC2, I briefly talked about using RAID 10 for additional performance and fault tolerance when using EBS volumes. At first, this seems counter intuitive. Amazon has a robust infrastructure, EBS volumes run on RAIDed hardware, and are mirrored in multiple availability zones. So, why bother? Today, I was reminded of just how important it is. Please note that all my performance statistics are based on direct experience running a MySQL database on a m2.4xlarge instance and not on some random bonnie or orion benchmark. I have those graphs floating around on my hard drive in glorious 3D and, while interesting, they do not necessarily reflect real-life performance.

Why? Part 1. Performance

Let’s get to the point. EBS is cool and very very flexible, but nominal performance is poor and highly variable with average latencies (svctime in iostat) in the 2-10ms range . At its heart, EBS is Network Attached Storage and shares bandwidth with your instance NIC. At best, I see 1.5ms svctime and 10ms await, and at worst…well, at worst you don’t need ms precision to measure it. On top of that, a single EBS volume seems to peak out at around 100-150 iops, which is about what one would expect from a single SATA drive. That’s fine if you’re running a low-traffic website with very little disk activity, but once the requests start to come in, things get a little squirrelly. Add in multi-tenancy and a noisy neighbor can really beat your disk into submission.

So, what’s a lowly Systems Engineer to do when the iowait time starts to pile up? Well, it turns out that those IOPs are initially bound by the disk on the backend and not local NIC traffic, so you can use Linux Software RAID to significantly improve the I/O capacity of your disk (but not the latency or variability…more on this later). For a performance boost, there is a lot of bad advice on the Internet saying you should RAID 0 your disk (because “it’s redundant on the back end”), but to the the discriminating SysEng, that should scream bad idea.

Why? Part 2. Redundancy

Right, so EBS is RAIDed and mirrored in multiple availability zones on the back end, so why do I need to worry about redundancy? That’s great and all, but with the EBS cool factor comes additional complexity and new and unexpected failure modes. The first and most obvious was #ec2pocalypse, otherwise known as the Great Reddit Fail of 2011. If you’re not aware of what happened (and the details are somewhat irrelevant), but a couple months back someone pressed the wrong button at Amazon and a significant percentage of EBS volumes became “stuck” showing 100% utilization and no iops. This failure lasted several days and took out a large number of websites that based their infrastructure on EBS. Most of the data itself was recovered, but a small percentage of people were SOL. So much for redundancy.

Enter RAID10. Yes, it’s slower than RAID0 because you have to write twice. Yes, you are bound by the worst performing disk in the array. But, you can get nearly 1:1 increase in IOPs (up to a point) and gain the ability to recover your data when Amazon drops the ball.

You need proof? “Give me an example,” you say? Let’s talk about what happened to me today. Everything was just peachy all day – performance was within parameters and then at 3:15PM, all of a sudden the database started having random query pile ups. Being in EC2, this was not unexpected, but it kept happening. Traffic was on a decline, but we were expecting big traffic in an hour or so. So, I started looking at the disk. We have a 10-drive RAID10 array on our master DB and 1 of those disks was showing svctime in the 30-100ms range, vs 2-10ms on all the others. BINGO!

I didn’t save the actual iostat output, but sar showed this:

03:15:01 PM DEV       tps avgqu-sz  await svctm %util
03:35:01 PM dev8-133 7.78     0.11  13.49  2.28  1.77
03:35:01 PM dev8-130 6.54     0.09  14.14  2.27  1.48
03:35:01 PM dev8-149 8.34     0.11  12.62  2.08  1.74
03:35:01 PM dev8-132 7.67     0.10  13.29  1.98  1.52
03:35:01 PM dev8-131 8.66     0.11  12.27  1.91  1.65
03:35:01 PM dev8-147 7.13     0.10  13.77  2.13  1.52
03:35:01 PM dev8-129 7.58     0.08  10.56  1.73  1.31
03:35:01 PM dev8-148 8.47     4.30 506.96 54.77 46.36
03:35:01 PM dev8-146 8.17     0.08   9.28  1.38  1.13
03:35:01 PM dev8-145 6.70     0.26  39.36  6.87  4.60

dev8-148 sure looks fishy, eh? (Oh, side note…to align this data all pretty-like, I used the aptly named align, a great tool from the Aspersa Toolkit)

Had this been a single volume EBS or RAID0 volume, we would have been forced to perform a database failover to a secondary master and redirect the application, which would have interrupted sales briefly during an active time. Instead, thanks to RAID10, we have options. Instead of a failover during a period of relatively high traffic, we simply failed out the problem drive. Now we were running on 9 drives and with reduced redundancy, but performance immediately recovered and the stalls stopped. We can replace the drive later and resync the array when traffic is low.

How?

First, you need to create and attach “a bunch” of volumes to your instance. How many? I’ve seen diminishing returns after 8-10 disks, but your mileage (and instance size) may vary. Typical RAID10 rules apply here…you need 2x the total capacity and each disk has to equal 2*(capacity)/(num disks), so if you need 1TB usable and want to use 8 disks, you will need each disk to be 256GB.

Here’s some code to do that. It creates 8x256GB volumes in the us-east-1a zone and then attaches them to instance i-1a2b3c4d

for x in {1..8); do \
  ec2-create-volume --size 256 --zone us-east-1a; \
done > /tmp/vols.txt

(i=0; \
for vol in $(awk '{print $2}' /tmp/vols.txt); do \
  i=$(( i + 1 )); \
  ec2-attach-volume $vol -i i-1a2b3c4d -d /dev/sdh${i}; \
done)

Then, you need to install Linux Software RAID. On Debian or Ubuntu:
apt-get install mdadm

Then, create a new RAID 10 (-l10) volume from 8 disks (-n8):
mdadm --create -l10 -n8 /dev/md0 /dev/sdh*

With any luck, you’ll get a message saying that the array was started. You can verify this by looking at /proc/mdstat and you should see something like this (the numbers in this example are probably off. I pulled them together from some random machines)

cat /proc/mdstat
Personalities : [raid10] 
md0 : active raid10 sdh6[5] sdh5[4] sdh4[3] sdh3[2] sdh2[1] sdh1[0]
      1048575872 blocks 64K chunks 2 near-copies [6/6] [UUUUUU]
      [==>..................]  resync = 13.3% (431292736/3221225280) finish=7721.9min speed=6021K/sec

Your disk will spend a lot of time and IOPs resyncing, but you can format /dev/md0 and mount it right away.

This wasn’t meant as a complete guide to Linux Software RAID – if you want to know more, check out The Software-RAID HOWTO.

The Bad

Ok, so the observant among you will realize that by having 8 or 10 disks in the array, all with the potential to have severe performance degradation like this, I have drastically increased the variability of latency. Well, you would be right, but…

  1. I can’t get IOPs any other way in EC2
  2. It is easy to recover from the most common failure mode with this setup
  3. If you care about your data at all, RAID0 (or no RAID) is doing it wrong

Remember, kids…Friends don’t let friends RAID0.

29 Comments

  • Kobi biton says:

    What a great post ! I am just in the middle of investigating extra large instance + EBS on EC2 for a product called Splunk , I was shocked that all the posts I have been reading are recommending RAID 0 … I thought I have gone mad … :-)

    Can you share which filesystem do you use ? xfs/ext4? how do you backup your DB do you use the ec2 snapshot ? if so how do you maintain consistency?

    Thanks!
    Kobi.

  • Aaron says:

    @Kobi – don’t believe those RAID0 proponents. IMO, that’s nuts and I have seen enough EBS problems to know better.

    I have used XFS pretty much everywhere for RAIDed EBS volumes, as it allows you to freeze the filesystem to do a snapshot. Alestic.com has a tool called ec2-consistent-snapshot that will freeze the filesystem, take a snapshot of all the volumes, and then thaw the filesystem, all in a period of a few seconds. If you can tolerate a daily “outage” of that sort, that is your best option. The process is not complicated, and if you have needs that differ from what ec2-consistent-snapshot provides, it is easy enough to implement your own version.

    Splunk is a great product. I would have to check my class notes, but I *think* that with Splunk, you would need to perform some extra steps and shut down the Splunk server first before the filesystem freeze in order to ensure internal consistency.

    With backing up a MySQL database, you can do one of a couple things. If you can tolerate a InnoDB recovery time at startup upon recovery, you can perform a hot backup by executing a FLUSH TABLES WITH READ LOCK, then an XFS freeze, then the snapshot. Otherwise, your best bet is a graceful shutdown of MySQL followed by an XFS freeze, and a snapshot. ec2-consistent-snapshot will do both of those types of backups for you. Other server products would require you to roll your own hot or cold backup routine, but XFS freeze is the key.

    When you restore, you just have to create/attach volumes from all the snapshots and use mdadm –assemble. Normally this means a resync period afterwards, which will cause additional I/O load, but the restore is nearly instantaneous.

  • Kobi biton says:

    Aaron thanks for the detailed reply , I probably will need to stop the splunk server for a consistent snapshot I have a conf. call with their eng. will ask this question worse come to worse I will use LVM2 which deals with the consistency issue will update.

    Thanks!
    Kobi.

  • Lawrence Pit says:

    Awesome post. Question: as the volumes created are virtualized by amazon, there is a chance that all 8 or 10 are created on the exact same physical disk, therefor when that physical disk loses say power, then the complete raid10 array is down? Any tricks to ensure the created volumes are in fact on different physical disks and power supplies? (or does this go beyond the purpose of raid?)

    • Aaron says:

      Unfortunately, there is no way to control where your disks are spun up in the Amazon infrastructure, but I have never experienced the issue you are describing. However, Amazon has extensive redundancy within their infrastructure – it’s not as if an EBS volume == a single SATA drive (even though performance is similar). Behind the scenes, there is extensive mirroring performed by Amazon.

    • Actually, that risk is reduced by the way Amazon mirrors EBS volumes. If the physical drive that contain your EBS volumes goes down, another physical drive will take over. I haven’t experienced or heard of any noticable interruptions caused by this method.

      However, if you want to eliminate the risk of your EBS volumes being backed by the same physical drive, you have the possibility of choosing 1 TB volumes instead of smaller ones. It entirely eliminates multi-tenancy, as suggested by Adrian Cockcroft:
      http://perfcap.blogspot.nl/2011/03/understanding-and-using-amazon-ebs.html

      However, when setting up a RAID10 stack with 10 1TB drives, this will set you back about $1,000.- per server per month, which might hold you back.

    • Ryan says:

      I know this is old, but I just wanted to add something. Yes, it’s possible (albeit extremely, extremely unlikely) that all active data is running on a single server. However, that doesn’t mean it won’t function if the server fails. There are mirrors to other servers in the same AZ, with the control plane determining the active copy. The fact that all active copies could be on the same disk doesn’t have a meaningful effect compared to all data being on different disks.

  • Thanks for the great post Aaron. The Percona Live video was very helpful too, wish I’d been able to make it to that conference.

    Any advice on how to grow an EBS RAID10 array? I’ve seen comments online that it’s both impossible and possible. I assume it’s something along the lines of stopping the md, snapshotting the volumes, creating new larger volumes from the snapshot, attaching those in place of the originals, and then some magic I’m missing. Any ideas?

    I get this error when trying to re-assemble:

    mdadm: no recogniseable superblock on /dev/sdm1
    mdadm: /dev/sdm1 has no superblock – assembly aborted

    Thanks again, and rock on!

  • Aaron says:

    @Joshua – as far as I know, it’s possible, but you have to mess with the superblock. I have seen solutions that simulate RAID10 by creating multiple RAID1 arrays and use LVM to effectively RAID0 the RAID1 arrays, thus creating a hybrid RAID10. I have not tried it myself, but it seems like a good solution. It also allows you to know exactly which disks can be safely failed out, something that is difficult to discern with md.

    My next post, when I get some time to finish it up, is going to cover expanding disks. Maybe I test and add this LVM piece in there.

  • Gunners says:

    @Joshua – I think I had a similar experience ( not sure because I don’t remember the exact error message ). The thing that fixed it for me was this command
    mdadm –assemble –scan

    Our configuration includes RAID 10 setup with 4 drives (each of 100 GB giving a total capacity of 200G). We are not sure how much data we will get, but I think a day will come when we will go past 200G. So, I was thinking of putting this RAID10 structure under LVM from the start.

    Then when the day comes, I will just add 4 new drives (off some xxx size), and add the new raid10 device to my logical volume..

    I have little clue about how sound this strategy is.. Any comments here will help.. Thanks..

  • Henri says:

    EBS volumes are SAN partitions isn’t it and Amazon is pretty clear about the fact that each EBS volume has redundancy built in and that adding additional redundancy is not particularly recommended (at that point other failure modes dominate anyway).

    Did some of you get errors/problems with EBS drives allready (except major zone failure as we get some month ago).

    I’ll be very interesting about LVM of EBS vs RAID of EBS.

    Cheers

  • Aaron says:

    Henri – as I mentioned in my article, the issue is not about drive failure in the conventional sense. Amazon does protect you against a SATA drive crashing. What Amazon does not protect against is sudden performance degradation or failure of the EBS technology itself.

    Theory is great, but I was in the trenches when EBS failed. When seemingly half the Internet was down for 4 days, we were only out for 45 minutes or so in the middle of the night. This was entirely because of our replication topology and usage of EBS with RAID10.

  • Marc says:

    Anyone know of a way to do this with Windows instances? Thanks.

  • Sabya says:

    Hello Aaron, thanks for sharing this awesome post. Could you explain the growing of EBS?

  • Adam says:

    First of all, thanks for the quick and simple guide on how to set all of this up!

    There’s just one small thing I wanted to point out, possibly saving someone else the headache in the future:
    First of all, I’m running an Amazon Linux AMI, and used 4 EBS volumes to create the RAID 10 array. After completing the steps above and rebooting the instance, the device “/dev/md0″ disappears, and mdadm creates “/dev/md127″ instead. I’ve seen other people complain about this problem (bug?) with other distros, but the specifics of it are beyond my understanding of how mdadm works.

    Luckily there’s a simple fix! After completing all of the steps above, run the following:
    [code]
    mkdir /etc/mdadm # The Amazon Linux AMI doesn't have this folder by default
    mdadm -Es > /etc/mdadm/mdadm.conf
    [/code]
    This will make sure that your RAID array is properly re-assembled on reboot.

    I know you did say this isn’t a complete guide to RAID on Linux, but perhaps this would be worth including as the final step…

    Thanks again,
    Adam

  • Aaron says:

    Thanks Adam – I didn’t know about that command!

  • Muhammad says:

    Great writeup Aaron!

    One question: What if the instance on which you have configured the Raid10 array crashes? is there a work around for that apart from snapshot-ting the logical volume and backing it up on S3?

    • Aaron says:

      Hi Muhammed,

      If the instance crashes, you just spin up new a new instance in the same AZ, detach the volumes from the old instance, attach them to the new instance, and reassemble the RAID array with `mdadm –assemble /dev/md0 /dev/sdX*`

  • Rudi Meyer says:

    Thanks for sharing this usefull information, I have gone back and forth on the EBS/RAID question – this took me alot further!

    I’m using the “force-detach” EBS command to simulate a failed EBS drive, this seems to hang a linux system no matter if you use LVM, software raid or mount directly – which led me to this post.
    Maybe it’s not a good testing method?, how would you test out a scenario where EBS disappoints you?

    • Aaron says:

      I haven’t had a problem with a force detach causing the system to hang. Perhaps there is a problem with the AMI that you’re using?

      • Rudi Meyer says:

        This is possible, I’m using an official Ubuntu AMI. I made the same test on a RedHat distribution – without problems.
        What AMI are your systems based upon?

  • Neal says:

    Thanks – this is definitely helpful. I wondered what you thought of the new IOPS provisioned EBS volumes. Would you incorporate both those *and* a RAID10 config like you describe here to give you better IO with the ability to recover from failures? Or another recipe?

  • Malcolm says:

    Aaron, with Ubuntu 12.04 I’ve seen raid sets come up as /dev/md127 even though I had a proper mdadm.conf. I never saw it on 8.04 or 10.04.

    Running ‘sudo update-initramfs -u’ will resolve that, however you can get into a bad place if you have your raid set mounting in /etc/fstab and your instance restarts before initramfs is updated.

    Just wanted to note that I’ve seen what Adam mentioned before.

  • Dan Pritts says:

    Is it possible to set up an AWS VM to boot from a software RAID? My google-fu is failing me on this one…thanks

  • And what would you do if you have setup mysql with a replica slave? Would you still use RAID 10 or RAID 0? If something goes wrong with the RAID0 on the master database, you can switch to the slave database, and then take some time to recreate/fix the RAID0 for the master database to finally switch back to the master database.

  • elkay14 says:

    Make sure you use a 1.1 superblock. If you use a 1.0 or under then it puts it at the END of the volume. Once you snapshot and resize, it isn’t at the end anymore…

    1.1 superblocks are at the beginning of the volume.

    See: https://raid.wiki.kernel.org/index.php/RAID_superblock_formats#Sub-versions_of_the_version-1_superblock

  • BRR says:

    Hello – Good info. One important question though – ca multiple EC2 instances use this RAID10 EBS storage ? Not sure if that is allowed, perhaps only one EC2 instance can use this RAID10 EBS storage thus created.

    • Unfortunately not, an EBS device is block-level storage and cannot be mounted on more than one EC2 instance at a time. You could, however, use some kind of network storage file system (like NFS) on top of your EBS-volume, enabling other EC2 instances to write to and read from your EBS volume.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>