ANALYZE TABLE is replicated. RTFM.

Sometimes, I make mistakes. It’s true. That can be difficult for us Systems Engineering-types to say, but I try to distance myself from my ego and embrace the mistakes because I often learn the most from them. ..Blah, blah, school of hard knocks, blah, blah…. Usually my mistakes aren’t big enough to cause any visible impact, but this one took the site out for 10 minutes during a period of peak traffic due to a confluence of events.

Doh!

Here is how it went down…

We have an issue where MySQL table statistics are occasionally getting out of whack, usually after a batch operation. This causes bad explain plans, which in turn cause impossibly slow queries. An ANALYZE TABLE (or even SHOW CREATE INDEX) resolves the issue immediately, but I prefer not get woken up at 4AM by long running query alerts when my family and I are trying to sleep. As a way to work around the issue, we decided to disable InnoDB automatic statistic calculations by setting innodb_stats_auto_update=0. Then, we would run ANALYZE TABLE daily (via cron) during a low traffic period to force MySQL to update table statistics. This creates more stable and predictable query execution plans and reduces the number of places where we have to add explicit USE/FORCE/IGNORE INDEX clauses in the code to work around the query optimizer.

To accomplish this, I wrote a very simple shell script that runs ANALYZE TABLE against all InnoDB tables. After testing it in a non-production environment, it was pushed out to our passive (unused) master database with puppet. Because it was going to execute in the middle of the night for the first time, I decided to run it by hand once on our passive master database just to make sure everything was kosher. Call me a wimp, but I don’t like getting up in the middle of the night because my script took the site down (see comment about family and sleeping). We run our master/master databases in active/passive mode, so testing this on the passive server was a safe move.

Theoretically.

A little background on ANALYZE TABLE on InnoDB tables: All it really does is force a recalculation of table statistics and flush the table. A read lock is held for the duration of the statement, so you want to avoid running this on a customer-facing server that is taking traffic. Because the table is flushed, the next thread that needs to access the table will have to open it again. On our servers with FusionIO cards, it takes about 5 seconds to run ANALYZE TABLE on over 250 tables. All this was fine in Myopia City, because I was running this on the passive server.

Meanwhile, in another zip code, someone was testing out a SELECT against a production data set…

While I was testing my ANALYZE TABLE script, I receive an ominous, “yt?” message in Skype.

(Sidebar: In the history of Operations, has an engineer ever received a “yt?” message that lead to something awesome? Like, “yt? We’re going to send you a batch of fresh baked cookies every day for the next month.” That never happens.)

So, now I’m in a call. SITE DOWN! OMFGWTFBYOB!!! (No, it wasn’t like that. Really, we’re pretty cool-headed about stuff like this). This outage appeared to be database related. I logged in and checked the process list to see what was running:

mysql> SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST WHERE INFO <> 'NULL' ORDER BY TIME;
*************************** 1. row ***************************
           ID: 19210373
         USER: me
         HOST: localhost
           DB: production
      COMMAND: Query
         TIME: 0
        STATE: executing
         INFO: SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST WHERE INFO <> 'NULL' ORDER BY TIME
      TIME_MS: 0
*************************** 2. row ***************************
           ID: 19210713
         USER: user
         HOST: 10.x.x.x:59900
           DB: production
      COMMAND: Query
         TIME: 1
        STATE: Waiting for table
         INFO: SELECT * FROM `table` WHERE (`table`.`l_id` IN (3,11,15,7)) AND (`table`.s_id = 1234)
      TIME_MS: 1474
*************************** 3. row ***************************
           ID: 19154978
         USER: user
         HOST: 10.x.x.x:45915
           DB: production
      COMMAND: Query
         TIME: 1
        STATE: Waiting for table
         INFO: SELECT count(*) AS count_all FROM `table` WHERE (`table`.sku_id = 2345)
      TIME_MS: 3737

180 more queries in "Waiting for table" state
*************************** 181. row ***************************
           ID: 19203223
         USER: user
         HOST: 10.x.x.x:34299
           DB: production
      COMMAND: Query
         TIME: 607
        STATE: Waiting for table
         INFO: SELECT * FROM `table` WHERE (`table`.s_id = 4567)
      TIME_MS: 606530
*************************** 182. row ***************************
           ID: 19203223
         USER: user
         HOST: 10.x.x.x:34299
           DB: production
      COMMAND: Query
         TIME: 607
        STATE: Waiting for table
         INFO: SELECT * FROM `table` WHERE (`table`.s_id = 4567)
      TIME_MS: 606530
*************************** 182. row ***************************
           ID: 19198325
         USER: user
         HOST: 10.x.x.x:56399
           DB: production
      COMMAND: Query
         TIME: 712
        STATE: Sending data
         INFO: SELECT RUN_LONG_TIME FROM `table`
      TIME_MS: 711545

view raw gistfile1.sql This Gist brought to you by GitHub.

(queries modified to protect the guilty)

That’s…strange. The RUN_LONG_TIME query seems to be blocking all the other queries on that table. But it’s just a SELECT. I looked at SHOW ENGINE INNODB STATUS and it didn’t have anything interesting in it. There were no row or table locks, no UPDATE/INSERT/DELETE, or SELECT FOR UPDATE queries, and innodb_row_lock_waits was not incrementing. A colleague noted that there were a lot of entries in the MySQL error log, so I looked at that and found (amongst the clutter):

83109 production.table Locked - write High priority write lock
83109 production.table Locked - read Low priority read lock

view raw gistfile1.txt This Gist brought to you by GitHub.

We were in an outage and the most important thing at this point was to resume selling shoes, dresses, and lingerie, so I collected as much data as I could for later review, dumped it into Evernote and killed the RUN_LONG_TIME query. Bam, the queries in “Waiting for table” state finished and the site came back online. Had that not solved the problem, another team member had his finger on the “fail over to the other server” button.

Outage over. Phew.

But, as my toddler likes to say — “What just happened?” The RUN_LONG_TIME query was expensive, but it shouldn’t have been blocking other queries from completing. First step, I went to a reporting server and tried to reproduce it:

session1> SELECT RUN_LONG_TIME FROM table;
session2> SELECT * FROM table WHERE id = 123

view raw gistfile1.sql This Gist brought to you by GitHub.

All copasetic. What’s next, chief?

Time to look at some graphs. Because we have the complete output of SHOW GLOBAL STATUS logging to Graphite every few seconds, it is easy see what the server is doing at any given time. (You should do that, too. It’s incredibly valuable.) I started poking around at the charts on the active server and noticed a few oddities:

There was a lot of InnoDB buffer pool activity – several graphs looked like this:

That made sense, as the RUN_LONG_TIME query was sifting through a lot of data. A lot of data. A lot. 14 quadrillion rows, in my estimate.

After seeing that pattern across a number of other stats, I started poking through the Com_* variables. Com_analyze looked like this:

What fool ran ANALYZE TABLE a bunch of times at peak traffic on the active database!? This is where I contracted a case of the RTFMs. As it turns out, ANALYZE TABLE statements are written to the binary log and thus replicated unless you supply the LOCAL key word (ANALYZE LOCAL TABLE).

I had not supplied that keyword.

As a result of my missing keyword, the ANALYZE TABLE statements replicated to the active server during peak traffic periods while a very long running query was in progress. Intuitively that still shouldn’t have caused this behavior. ANALYZE TABLE takes less than a second on each table. But that isn’t the whole story…

Back to the reporting server to attempt to reproduce the behavior:

session1> SELECT RUN_LONG_TIME FROM table;
session2> ANALYZE TABLE table;
session3> SELECT * FROM table WHERE id=123;

view raw gistfile1.sql This Gist brought to you by GitHub.

The statement in session3 hung and was in “Waiting for table” status. Success (at failure)!

What happened is the ANALYZE TABLE flushed the table, which tells InnoDB to close all references before allowing access again. Because there was a query running while ANALYZE TABLE was executing, MySQL had to wait for the query to complete before allowing access from another thread. Because that query took so long, everything else hung out in “Waiting for table” state. The documentation on this point sort of explains the issue, though it is a little muddy:

The thread got a notification that the underlying structure for a table has changed and it needs to reopen the table to get the new structure. However, to reopen the table, it must wait until all other threads have closed the table in question.

This notification takes place if another thread has used FLUSH TABLES or one of the following statements on the table in question: FLUSH TABLES tbl_name, ALTER TABLE, RENAME TABLE, REPAIR TABLE, ANALYZE TABLE, or OPTIMIZE TABLE.

I explained the sequence of events and root cause to our team and also publicly flogged myself a bit. As it turns out, this issue only happened because of the combination of two different events happening simultaneously. The ANALYZE TABLE alone wouldn’t have been a big deal had there not also been a very long running query going at the same time.

I have a few take-aways from this:

  • If you make a mistake, fess up. That’s a lot better than covering it up and having someone find out about it later. People understand mistakes.
  • Mistakes are the best chances for learning. I can assure you, that I will never, ever forget that ANALYZE TABLE writes to the binary log.
  • Measure everything that you can, always. Without the output of SHOW GLOBAL STATUS being constantly charted in Graphite, I would have been blind to any abnormalities.
  • During an outage, resist the temptation to just “fix it” before grabbing data to analyze later. Pressure is on and getting things running is very high priority, but it is even worse if you fix the problem, don’t know why it occurred, and end up in the same situation again a week later.
  • Try not to perform seemingly innocuous tasks on production servers at peak times.
  • RTFM. Always. Edge cases abound in complex software.
Share

Analyzing HTTP traffic with tcpdump and Percona’s pt-tcp-model

I recently ran into an issue where our request throughput was showing very erratic and spikey behavior despite very smooth response times from the application servers. Using Splunk, we analyzed every log that we had: nginx, haproxy, apache, and the application logs themselves and we were seeing similarly spikey throughput. Because those tools all log upon request completion, there was no way to determine from the logs themselves whether it was one tier of the stack in particular that was delaying request arrival, or if it the spikes were endemic to the traffic we were receiving.

So, we decided to perform some analysis of the raw tcp data on the edge server using a couple of tools. First, was tcpdump, which is a tool that should be in every SysAdmin’s arsenal.

First, grab all the traffic on the interface and write it to a pcap formatted file:

# tcpdump -c 200000 -w output.pcap -i any

This command will capture 200k packets from any interface and write them to output.pcap, which can be later analyzed with a variety of tools, including tcpdump and wireshark.

All we care about is the actual packet count and only for “real” packets (no SYN/ACKs) on port 80. Extract this data from the capture we just made:

# tcpdump -r output.pcap -s 384 -i any -nnq -tttt \
      'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)< <2))
     - ((tcp[12]&0xf0)>>2)) != 0)' > port80.txt

(I stole this command from the pt-tcp-model documentation and honestly have not dived into the details of how the part after ‘tcp port 80′ actually works).

Producing some data that looks like this (IPs hidden to protect the innocent):

...
2011-10-10 12:49:02.662951 IP 175.x.x.x.ppppp > 10.x.x.x.x: tcp 37
2011-10-10 12:49:02.662958 IP 98.x.x.x.ppppp > 10.x.x.x.x: tcp 1380
2011-10-10 12:49:02.662963 IP 98.x.x.x.ppppp > 10.x.x.x.x: tcp 80
2011-10-10 12:49:02.662965 IP 98.x.x.x.ppppp > 10.x.x.x.x: tcp 463
2011-10-10 12:49:02.662968 IP 206.x.x.x.ppppp > 10.x.x.x.x: tcp 516
...

With a little bash, we can aggregate this data per second (I’m sure that there is a much more concise way of doing this, but it gets the job done):

# cut -c 12-21 port80.txt  |awk '{print $1}' |  sort | uniq -c  | awk '{print $2 " " $1}' > packets_per_sec.txt

The output looks like this…packets per second grouped by second:

12:49:02 912
12:49:03 2617
12:49:04 2277
12:49:05 1994
12:49:06 2120
12:49:07 2192

Next, it’s some simple gnuplot magic to chart it all out. Here’s the plot file:

set title "TCP Port 80 packets/sec"
set terminal aqua enhanced title "TCP Port 80 - packets/sec"
set xdata time
set xlabel "Time (EST)"
set timefmt "%H:%M:%S"
set format x "%H:%M:%S"
set ylabel "per second"
set datafile separator " "

set style line 1 linecolor rgb "#000000" lw 1

plot 'packets.txt' using 1:2 title "packets" with line ls 1

set terminal png font "/Library/Fonts/Arial.ttf"
set output "packets_count.png"

replot

Run that with

# gnuplot file.plot

If you are on a Mac, AquaTerm will probably pop up and show you the graph. If not, you can open the packets_count.png file. What I got, looked like this:

Packets/Sec

Packets/Sec

Ugly, eh? I’ve pointed out some problem areas with arrows. The rate of packet arrivals is incredibly variable – much more so than I would expect when the website is receiving 10s of thousands of requests per minute. At such rough granularity, I would expect to see a much smoother line.

That’s great and all…clearly there is a problem, but packets != requests. I want to know how this affects the end user.

Enter pt-tcp-model from the Percona Toolkit (formerly Maatkit by Baron Schwartz, Chief Performance Architect of Percona and author of High Performance MySQL).

In short, this tool will take data from tcpdump and convert the data in the packet headers into time sliced buckets with the number of request arrivals, completions, and other summary data. That can then be charted with gnuplot (or Excel, if you are so inclined) to get some pretty interesting results. To better understand the tool, I recommend you read the documentation and watch Baron’s presentation about Measuring Scalability and Performance with TCP.

First, following the directions verbatim, extract the data into requests and their response times, and slice that into 1 second intervals. One thing to note is that if your source data (port80.txt) contains more than about 300k lines, the tool starts to bog down a bit, so I’d recommend trying to work with smaller samples.

# pt-tcp-model port80.txt > requests.txt
# sort -n -k1,1 requests.txt > sorted.txt
# pt-tcp-model --type=requests --run-time=11 sorted.txt > sliced.txt

Now, you have sliced.txt which looks something like this:

1318265342 18.49   578.542   195   171 0.337054 6.232731 9.455878 0.190278 0.443786 0.337054
1318265343 27.51   527.000   527   526 1.000000 27.511997 27.842869 0.348546 0.583430 1.000000
1318265344 20.75   504.000   504   509 1.000000 20.748874 26.378252 1.166846 0.766312 1.000000
1318265345 23.96   461.000   461   462 1.000000 23.963005 32.181679 3.929070 0.679943 1.000000
1318265346 23.60   438.000   438   423 1.000000 23.595860 26.154968 0.421166 0.939690 1.000000

The columns are in the documentation, but in this case, I’m mostly interested in graphing time vs the number of complete requests arriving (columns 1 and 4).

Here’s some gnuplot for that:

set title "TCP Port 80 - arrivals/sec"
set terminal aqua enhanced title "TCP Port 80 - arrivals/sec"
set xdata time
set xlabel "Time (UTC)"
set timefmt "%s"
set format x "%H:%M:%S"
set ylabel "per second"
set datafile separator " "
set style line 1 linecolor rgb "#000000"

plot 'sliced.txt' using 1:4 title "arrivals" with line ls 1

set terminal png font "/Library/Fonts/Arial.ttf"
set output "lblockups.png"

replot

And here is the chart that I ended up with:

Arrivals/sec

Arrivals/sec

Now…that…is…ugly. When I dig into the raw data, the jagged packet arrival rate is frequently causing 1-2 second delays in the arrival rate of individual requests to our edge server. That is before we even get to nginx, so our app server had no hope. What this means, from a performance standpoint, is that the application stack has to be able to accommodate the huge influx of traffic after the lull. This presents a scalability nightmare, especially for an e-commerce website heading into the holiday season.

Time to call the data center…

Share

RAID 10 your EBS data

When I spoke at Percona Live (video here) on running an E-commerce database in Amazon EC2, I briefly talked about using RAID 10 for additional performance and fault tolerance when using EBS volumes. At first, this seems counter intuitive. Amazon has a robust infrastructure, EBS volumes run on RAIDed hardware, and are mirrored in multiple availability zones. So, why bother? Today, I was reminded of just how important it is. Please note that all my performance statistics are based on direct experience running a MySQL database on a m2.4xlarge instance and not on some random bonnie or orion benchmark. I have those graphs floating around on my hard drive in glorious 3D and, while interesting, they do not necessarily reflect real-life performance.

Why? Part 1. Performance

Let’s get to the point. EBS is cool and very very flexible, but nominal performance is poor and highly variable with average latencies (svctime in iostat) in the 2-10ms range . At its heart, EBS is Network Attached Storage and shares bandwidth with your instance NIC. At best, I see 1.5ms svctime and 10ms await, and at worst…well, at worst you don’t need ms precision to measure it. On top of that, a single EBS volume seems to peak out at around 100-150 iops, which is about what one would expect from a single SATA drive. That’s fine if you’re running a low-traffic website with very little disk activity, but once the requests start to come in, things get a little squirrelly. Add in multi-tenancy and a noisy neighbor can really beat your disk into submission.

So, what’s a lowly Systems Engineer to do when the iowait time starts to pile up? Well, it turns out that those IOPs are initially bound by the disk on the backend and not local NIC traffic, so you can use Linux Software RAID to significantly improve the I/O capacity of your disk (but not the latency or variability…more on this later). For a performance boost, there is a lot of bad advice on the Internet saying you should RAID 0 your disk (because “it’s redundant on the back end”), but to the the discriminating SysEng, that should scream bad idea.

Why? Part 2. Redundancy

Right, so EBS is RAIDed and mirrored in multiple availability zones on the back end, so why do I need to worry about redundancy? That’s great and all, but with the EBS cool factor comes additional complexity and new and unexpected failure modes. The first and most obvious was #ec2pocalypse, otherwise known as the Great Reddit Fail of 2011. If you’re not aware of what happened (and the details are somewhat irrelevant), but a couple months back someone pressed the wrong button at Amazon and a significant percentage of EBS volumes became “stuck” showing 100% utilization and no iops. This failure lasted several days and took out a large number of websites that based their infrastructure on EBS. Most of the data itself was recovered, but a small percentage of people were SOL. So much for redundancy.

Enter RAID10. Yes, it’s slower than RAID0 because you have to write twice. Yes, you are bound by the worst performing disk in the array. But, you can get nearly 1:1 increase in IOPs (up to a point) and gain the ability to recover your data when Amazon drops the ball.

You need proof? “Give me an example,” you say? Let’s talk about what happened to me today. Everything was just peachy all day – performance was within parameters and then at 3:15PM, all of a sudden the database started having random query pile ups. Being in EC2, this was not unexpected, but it kept happening. Traffic was on a decline, but we were expecting big traffic in an hour or so. So, I started looking at the disk. We have a 10-drive RAID10 array on our master DB and 1 of those disks was showing svctime in the 30-100ms range, vs 2-10ms on all the others. BINGO!

I didn’t save the actual iostat output, but sar showed this:

03:15:01 PM DEV       tps avgqu-sz  await svctm %util
03:35:01 PM dev8-133 7.78     0.11  13.49  2.28  1.77
03:35:01 PM dev8-130 6.54     0.09  14.14  2.27  1.48
03:35:01 PM dev8-149 8.34     0.11  12.62  2.08  1.74
03:35:01 PM dev8-132 7.67     0.10  13.29  1.98  1.52
03:35:01 PM dev8-131 8.66     0.11  12.27  1.91  1.65
03:35:01 PM dev8-147 7.13     0.10  13.77  2.13  1.52
03:35:01 PM dev8-129 7.58     0.08  10.56  1.73  1.31
03:35:01 PM dev8-148 8.47     4.30 506.96 54.77 46.36
03:35:01 PM dev8-146 8.17     0.08   9.28  1.38  1.13
03:35:01 PM dev8-145 6.70     0.26  39.36  6.87  4.60

dev8-148 sure looks fishy, eh? (Oh, side note…to align this data all pretty-like, I used the aptly named align, a great tool from the Aspersa Toolkit)

Had this been a single volume EBS or RAID0 volume, we would have been forced to perform a database failover to a secondary master and redirect the application, which would have interrupted sales briefly during an active time. Instead, thanks to RAID10, we have options. Instead of a failover during a period of relatively high traffic, we simply failed out the problem drive. Now we were running on 9 drives and with reduced redundancy, but performance immediately recovered and the stalls stopped. We can replace the drive later and resync the array when traffic is low.

How?

First, you need to create and attach “a bunch” of volumes to your instance. How many? I’ve seen diminishing returns after 8-10 disks, but your mileage (and instance size) may vary. Typical RAID10 rules apply here…you need 2x the total capacity and each disk has to equal 2*(capacity)/(num disks), so if you need 1TB usable and want to use 8 disks, you will need each disk to be 256GB.

Here’s some code to do that. It creates 8x256GB volumes in the us-east-1a zone and then attaches them to instance i-1a2b3c4d

for x in {1..8); do \
  ec2-create-volume --size 256 --zone us-east-1a; \
done > /tmp/vols.txt

(i=0; \
for vol in $(awk '{print $2}' /tmp/vols.txt); do \
  i=$(( i + 1 )); \
  ec2-attach-volume $vol -i i-1a2b3c4d -d /dev/sdh${i}; \
done)

Then, you need to install Linux Software RAID. On Debian or Ubuntu:
apt-get install mdadm

Then, create a new RAID 10 (-l10) volume from 8 disks (-n8):
mdadm --create -l10 -n8 /dev/md0 /dev/sdh*

With any luck, you’ll get a message saying that the array was started. You can verify this by looking at /proc/mdstat and you should see something like this (the numbers in this example are probably off. I pulled them together from some random machines)

cat /proc/mdstat
Personalities : [raid10]
md0 : active raid10 sdh6[5] sdh5[4] sdh4[3] sdh3[2] sdh2[1] sdh1[0]
      1048575872 blocks 64K chunks 2 near-copies [6/6] [UUUUUU]
      [==>..................]  resync = 13.3% (431292736/3221225280) finish=7721.9min speed=6021K/sec

Your disk will spend a lot of time and IOPs resyncing, but you can format /dev/md0 and mount it right away.

This wasn’t meant as a complete guide to Linux Software RAID – if you want to know more, check out The Software-RAID HOWTO.

The Bad

Ok, so the observant among you will realize that by having 8 or 10 disks in the array, all with the potential to have severe performance degradation like this, I have drastically increased the variability of latency. Well, you would be right, but…

  1. I can’t get IOPs any other way in EC2
  2. It is easy to recover from the most common failure mode with this setup
  3. If you care about your data at all, RAID0 (or no RAID) is doing it wrong

Remember, kids…Friends don’t let friends RAID0.

Share

Percona Live NYC 2011

A couple weeks back, I had the fortune of co-speaking at Percona Live NYC 2011 with Mark Uhrmacher, CTO of ideeli on the subject of running an E-commerce site with MySQL in the cloud. Interestingly, and a sign of the times, this was also the first time that I had ever met Mark, despite having worked for him for close to a year since I telecommute from home.

Here are the slides from that talk.

What I wanted attendees to get out of our talk was that you have to expect and plan for all sorts of failure situations when your database is in the cloud. Relative to conventional hosting or datacenters, things in the cloud break more frequently and in ways that are out of your control. AWS gives you the tools to plan and recover from these failures much more easily than having to put redundant physical servers in multiple geographic locations, but they also fail more often.

So, here are a few take-aways, mentioned in the slides

  • RAID 1 or RAID 10 (1+0) your EBS volumes

    Yes, EBS volumes are redundant on the back end, in a data center controlled by Amazon. However, the great EBS outage of 2011 (#ec2pocalypse) proves that you cannot entrust your data to a single technology that is out of your control. Had we RAID0′d our data set, we would have been in much worse shape, because we would have to completely rebuild many of our data sets from backup. So, no, you should not RAID0 (which should rightfully be called AID0, since the R is a fallacy). Yes, you take a performance hit, and you have to deal with lowest-common-denominator performance of the EBS volumes, but the ability to remove a failed or poorly performing EBS volume without losing your data more than makes up for that compromise. With 10 EBS volumes in a RAID 10 configuration, we max out at around 1200-1500 iops. Poor performance relative to physical hardware, but it is manageable.

    If you care about your data, never ever use RAID0. You might as well just point it at /dev/null, which as we know is webscale. Friends don’t let friends RAID0

  • Make sure your important data lives in multiple availability zones and multiple regions.

    During #ec2pocalypse, several instances were able to be recovered by simply pointing the application at data that already existed in another zone.

  • Don’t cross availability zones and regions between your ultimate master and your disaster recovery node.

    If so, (and we were bit by this), you may end up with out of date disaster recovery nodes if your distribution slave is in an affected availability zone. Keep replication chains short and all in one zone/region, except for the DR node, which should be somewhere outside of the master’s zone/region.

  • AWS snapshot backups are awesome. But they don’t help if the API is down. Make sure your data lives in multiple places where you can get at it in an emergency.

Also, I’d just like to say that Percona Live was a great conference. There were some incredibly informative talks. My favorite, by far, was Baron Schwartz’s discussion on using tcpdump to analyze server performance and predict scalability. I was honored to speak in front of a crowd where the average person in the room knows far more about MySQL than I do.

Share