<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The 9 Minute Snooze &#187; Aaron</title>
	<atom:link href="http://blog.9minutesnooze.com/author/aaron/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.9minutesnooze.com</link>
	<description>Photography, Tech, and more</description>
	<lastBuildDate>Sat, 04 Feb 2012 23:03:58 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>ANALYZE TABLE is replicated.  RTFM.</title>
		<link>http://blog.9minutesnooze.com/analyze-table-replicated-rtfm/</link>
		<comments>http://blog.9minutesnooze.com/analyze-table-replicated-rtfm/#comments</comments>
		<pubDate>Sat, 04 Feb 2012 22:43:28 +0000</pubDate>
		<dc:creator>Aaron</dc:creator>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[ideeli]]></category>
		<category><![CDATA[innodb]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[outage]]></category>

		<guid isPermaLink="false">http://blog.9minutesnooze.com/?p=423</guid>
		<description><![CDATA[]]></description>
			<content:encoded><![CDATA[<p>Sometimes, I make mistakes.  It&#8217;s true.  That can be difficult for us Systems Engineering-types to say, but I try to distance myself from my ego and embrace the mistakes because I often learn the most from them.  ..Blah, blah, school of hard knocks, blah, blah&#8230;.  Usually my mistakes aren&#8217;t big enough to cause any visible impact, but this one took the site out for 10 minutes during a period of peak traffic due to a confluence of events.</p>
<p>Doh!</p>
<p>Here is how it went down…</p>
<p>We have an issue where MySQL table statistics are occasionally getting out of whack, usually after a batch operation.  This causes bad explain plans, which in turn cause impossibly slow queries.  An ANALYZE TABLE (or even SHOW CREATE INDEX) resolves the issue immediately, but I prefer not get woken up at 4AM by long running query alerts when my family and I are trying to sleep.  As a way to work around the issue, we decided to disable InnoDB automatic statistic calculations by setting <a href="http://www.percona.com/doc/percona-server/5.1/diagnostics/innodb_stats.html?id=percona-server:features:innodb_stats&#038;redirect=1">innodb_stats_auto_update=0</a>.  Then, we would run ANALYZE TABLE daily (via cron) during a low traffic period to force MySQL to update table statistics.  This creates more stable and predictable query execution plans and reduces the number of places where we have to add explicit USE/FORCE/IGNORE INDEX clauses in the code to work around the query optimizer.</p>
<p>To accomplish this, I wrote a very simple shell script that runs ANALYZE TABLE against all InnoDB tables.  After testing it in a non-production environment, it was pushed out to our passive (unused) master database with puppet.  Because it was going to execute in the middle of the night for the first time, I decided to run it by hand once on our passive master database just to make sure everything was kosher.  Call me a wimp, but I don&#8217;t like getting up in the middle of the night because my script took the site down (see comment about family and sleeping).  We run our master/master databases in active/passive mode, so testing this on the passive server was a safe move.</p>
<p>Theoretically.</p>
<p>A little background on ANALYZE TABLE on InnoDB tables: All it really does is force a recalculation of table statistics and flush the table.  A read lock is held for the duration of the statement, so you want to avoid running this on a customer-facing server that is taking traffic.  Because the table is flushed, the next thread that needs to access the table will have to open it again.  On our servers with FusionIO cards, it takes about 5 seconds to run ANALYZE TABLE on over 250 tables.  All this was fine in Myopia City, because I was running this on the passive server.  </p>
<p>Meanwhile, in another zip code, someone was testing out a SELECT against a production data set&#8230;</p>
<p>While I was testing my ANALYZE TABLE script, I receive an ominous, &#8220;yt?&#8221; message in Skype.</p>
<p>(Sidebar: In the history of Operations, has an engineer ever received a &#8220;yt?&#8221; message that lead to something awesome?  Like, &#8220;yt?  We&#8217;re going to send you a batch of fresh baked cookies every day for the next month.&#8221;  That never happens.)</p>
<p>So, now I&#8217;m in a call.  SITE DOWN! OMFGWTFBYOB!!!  (No, it wasn&#8217;t like that.  Really, we&#8217;re pretty cool-headed about stuff like this).  This outage appeared to be database related.  I logged in and checked the process list to see what was running:</p>
<div id="gist-1740562" class="gist">

        <div class="gist-file">
          <div class="gist-data gist-syntax">
              <div class="highlight"><pre><div class='line' id='LC1'><span class="n">mysql</span><span class="o">&gt;</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">INFORMATION_SCHEMA</span><span class="p">.</span><span class="n">PROCESSLIST</span> <span class="k">WHERE</span> <span class="n">INFO</span> <span class="o">&lt;&gt;</span> <span class="s1">&#39;NULL&#39;</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">TIME</span><span class="p">;</span></div><div class='line' id='LC2'><span class="o">***************************</span> <span class="mi">1</span><span class="p">.</span> <span class="k">row</span> <span class="o">***************************</span></div><div class='line' id='LC3'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">ID</span><span class="p">:</span> <span class="mi">19210373</span></div><div class='line' id='LC4'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">USER</span><span class="p">:</span> <span class="n">me</span></div><div class='line' id='LC5'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">HOST</span><span class="p">:</span> <span class="n">localhost</span></div><div class='line' id='LC6'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">DB</span><span class="p">:</span> <span class="n">production</span></div><div class='line' id='LC7'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">COMMAND</span><span class="p">:</span> <span class="n">Query</span></div><div class='line' id='LC8'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">TIME</span><span class="p">:</span> <span class="mi">0</span></div><div class='line' id='LC9'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">STATE</span><span class="p">:</span> <span class="n">executing</span></div><div class='line' id='LC10'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">INFO</span><span class="p">:</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">INFORMATION_SCHEMA</span><span class="p">.</span><span class="n">PROCESSLIST</span> <span class="k">WHERE</span> <span class="n">INFO</span> <span class="o">&lt;&gt;</span> <span class="s1">&#39;NULL&#39;</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">TIME</span></div><div class='line' id='LC11'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">TIME_MS</span><span class="p">:</span> <span class="mi">0</span></div><div class='line' id='LC12'><span class="o">***************************</span> <span class="mi">2</span><span class="p">.</span> <span class="k">row</span> <span class="o">***************************</span></div><div class='line' id='LC13'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">ID</span><span class="p">:</span> <span class="mi">19210713</span></div><div class='line' id='LC14'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">USER</span><span class="p">:</span> <span class="k">user</span></div><div class='line' id='LC15'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">HOST</span><span class="p">:</span> <span class="mi">10</span><span class="p">.</span><span class="n">x</span><span class="p">.</span><span class="n">x</span><span class="p">.</span><span class="n">x</span><span class="p">:</span><span class="mi">59900</span></div><div class='line' id='LC16'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">DB</span><span class="p">:</span> <span class="n">production</span></div><div class='line' id='LC17'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">COMMAND</span><span class="p">:</span> <span class="n">Query</span></div><div class='line' id='LC18'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">TIME</span><span class="p">:</span> <span class="mi">1</span></div><div class='line' id='LC19'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">STATE</span><span class="p">:</span> <span class="n">Waiting</span> <span class="k">for</span> <span class="k">table</span></div><div class='line' id='LC20'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">INFO</span><span class="p">:</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="o">`</span><span class="k">table</span><span class="o">`</span> <span class="k">WHERE</span> <span class="p">(</span><span class="o">`</span><span class="k">table</span><span class="o">`</span><span class="p">.</span><span class="o">`</span><span class="n">l_id</span><span class="o">`</span> <span class="k">IN</span> <span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">11</span><span class="p">,</span><span class="mi">15</span><span class="p">,</span><span class="mi">7</span><span class="p">))</span> <span class="k">AND</span> <span class="p">(</span><span class="o">`</span><span class="k">table</span><span class="o">`</span><span class="p">.</span><span class="n">s_id</span> <span class="o">=</span> <span class="mi">1234</span><span class="p">)</span></div><div class='line' id='LC21'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">TIME_MS</span><span class="p">:</span> <span class="mi">1474</span></div><div class='line' id='LC22'><span class="o">***************************</span> <span class="mi">3</span><span class="p">.</span> <span class="k">row</span> <span class="o">***************************</span></div><div class='line' id='LC23'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">ID</span><span class="p">:</span> <span class="mi">19154978</span></div><div class='line' id='LC24'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">USER</span><span class="p">:</span> <span class="k">user</span></div><div class='line' id='LC25'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">HOST</span><span class="p">:</span> <span class="mi">10</span><span class="p">.</span><span class="n">x</span><span class="p">.</span><span class="n">x</span><span class="p">.</span><span class="n">x</span><span class="p">:</span><span class="mi">45915</span></div><div class='line' id='LC26'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">DB</span><span class="p">:</span> <span class="n">production</span></div><div class='line' id='LC27'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">COMMAND</span><span class="p">:</span> <span class="n">Query</span></div><div class='line' id='LC28'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">TIME</span><span class="p">:</span> <span class="mi">1</span></div><div class='line' id='LC29'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">STATE</span><span class="p">:</span> <span class="n">Waiting</span> <span class="k">for</span> <span class="k">table</span></div><div class='line' id='LC30'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">INFO</span><span class="p">:</span> <span class="k">SELECT</span> <span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">AS</span> <span class="n">count_all</span> <span class="k">FROM</span> <span class="o">`</span><span class="k">table</span><span class="o">`</span> <span class="k">WHERE</span> <span class="p">(</span><span class="o">`</span><span class="k">table</span><span class="o">`</span><span class="p">.</span><span class="n">sku_id</span> <span class="o">=</span> <span class="mi">2345</span><span class="p">)</span>                                        </div><div class='line' id='LC31'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">TIME_MS</span><span class="p">:</span> <span class="mi">3737</span></div><div class='line' id='LC32'><br/></div><div class='line' id='LC33'><span class="err">…</span> <span class="mi">180</span> <span class="k">more</span> <span class="n">queries</span> <span class="k">in</span> <span class="ss">&quot;Waiting for table&quot;</span> <span class="k">state</span> <span class="err">…</span></div><div class='line' id='LC34'><span class="o">***************************</span> <span class="mi">181</span><span class="p">.</span> <span class="k">row</span> <span class="o">***************************</span></div><div class='line' id='LC35'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">ID</span><span class="p">:</span> <span class="mi">19203223</span></div><div class='line' id='LC36'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">USER</span><span class="p">:</span> <span class="k">user</span></div><div class='line' id='LC37'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">HOST</span><span class="p">:</span> <span class="mi">10</span><span class="p">.</span><span class="n">x</span><span class="p">.</span><span class="n">x</span><span class="p">.</span><span class="n">x</span><span class="p">:</span><span class="mi">34299</span></div><div class='line' id='LC38'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">DB</span><span class="p">:</span> <span class="n">production</span></div><div class='line' id='LC39'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">COMMAND</span><span class="p">:</span> <span class="n">Query</span></div><div class='line' id='LC40'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">TIME</span><span class="p">:</span> <span class="mi">607</span></div><div class='line' id='LC41'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">STATE</span><span class="p">:</span> <span class="n">Waiting</span> <span class="k">for</span> <span class="k">table</span></div><div class='line' id='LC42'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">INFO</span><span class="p">:</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="o">`</span><span class="k">table</span><span class="o">`</span> <span class="k">WHERE</span> <span class="p">(</span><span class="o">`</span><span class="k">table</span><span class="o">`</span><span class="p">.</span><span class="n">s_id</span> <span class="o">=</span> <span class="mi">4567</span><span class="p">)</span>                                                                                                         </div><div class='line' id='LC43'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">TIME_MS</span><span class="p">:</span> <span class="mi">606530</span></div><div class='line' id='LC44'><span class="o">***************************</span> <span class="mi">182</span><span class="p">.</span> <span class="k">row</span> <span class="o">***************************</span></div><div class='line' id='LC45'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">ID</span><span class="p">:</span> <span class="mi">19203223</span></div><div class='line' id='LC46'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">USER</span><span class="p">:</span> <span class="k">user</span></div><div class='line' id='LC47'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">HOST</span><span class="p">:</span> <span class="mi">10</span><span class="p">.</span><span class="n">x</span><span class="p">.</span><span class="n">x</span><span class="p">.</span><span class="n">x</span><span class="p">:</span><span class="mi">34299</span></div><div class='line' id='LC48'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">DB</span><span class="p">:</span> <span class="n">production</span></div><div class='line' id='LC49'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">COMMAND</span><span class="p">:</span> <span class="n">Query</span></div><div class='line' id='LC50'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">TIME</span><span class="p">:</span> <span class="mi">607</span></div><div class='line' id='LC51'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">STATE</span><span class="p">:</span> <span class="n">Waiting</span> <span class="k">for</span> <span class="k">table</span></div><div class='line' id='LC52'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">INFO</span><span class="p">:</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="o">`</span><span class="k">table</span><span class="o">`</span> <span class="k">WHERE</span> <span class="p">(</span><span class="o">`</span><span class="k">table</span><span class="o">`</span><span class="p">.</span><span class="n">s_id</span> <span class="o">=</span> <span class="mi">4567</span><span class="p">)</span>                                                                                                         </div><div class='line' id='LC53'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">TIME_MS</span><span class="p">:</span> <span class="mi">606530</span></div><div class='line' id='LC54'><span class="o">***************************</span> <span class="mi">182</span><span class="p">.</span> <span class="k">row</span> <span class="o">***************************</span></div><div class='line' id='LC55'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">ID</span><span class="p">:</span> <span class="mi">19198325</span></div><div class='line' id='LC56'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">USER</span><span class="p">:</span> <span class="k">user</span></div><div class='line' id='LC57'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">HOST</span><span class="p">:</span> <span class="mi">10</span><span class="p">.</span><span class="n">x</span><span class="p">.</span><span class="n">x</span><span class="p">.</span><span class="n">x</span><span class="p">:</span><span class="mi">56399</span></div><div class='line' id='LC58'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">DB</span><span class="p">:</span> <span class="n">production</span></div><div class='line' id='LC59'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">COMMAND</span><span class="p">:</span> <span class="n">Query</span></div><div class='line' id='LC60'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">TIME</span><span class="p">:</span> <span class="mi">712</span></div><div class='line' id='LC61'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">STATE</span><span class="p">:</span> <span class="n">Sending</span> <span class="k">data</span></div><div class='line' id='LC62'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">INFO</span><span class="p">:</span> <span class="k">SELECT</span> <span class="n">RUN_LONG_TIME</span> <span class="k">FROM</span> <span class="o">`</span><span class="k">table</span><span class="o">`</span></div><div class='line' id='LC63'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">TIME_MS</span><span class="p">:</span> <span class="mi">711545</span></div><div class='line' id='LC64'><br/></div></pre></div>
          </div>

          <div class="gist-meta">
            <a href="https://gist.github.com/raw/1740562/d47d23a56e241de56f1a6438e674bf239ae5f3bd/gistfile1.sql" style="float:right;">view raw</a>
            <a href="https://gist.github.com/1740562#file_gistfile1.sql" style="float:right;margin-right:10px;color:#666">gistfile1.sql</a>
            <a href="https://gist.github.com/1740562">This Gist</a> brought to you by <a href="http://github.com">GitHub</a>.
          </div>
        </div>
</div>

<p>(queries modified to protect the guilty)</p>
<p>That&#8217;s&#8230;strange.  The RUN_LONG_TIME query seems to be blocking all the other queries on that table.  But it&#8217;s just a SELECT.  I looked at SHOW ENGINE INNODB STATUS and it didn&#8217;t have anything interesting in it.  There were no row or table locks, no UPDATE/INSERT/DELETE, or SELECT FOR UPDATE queries, and innodb_row_lock_waits was not incrementing.  A colleague noted that there were a lot of entries in the MySQL error log, so I looked at that and found (amongst the clutter): </p>
<div id="gist-1740568" class="gist">

        <div class="gist-file">
          <div class="gist-data gist-syntax">
              <div class="highlight"><pre><div class='line' id='LC1'>83109   production.table Locked - write        High priority write lock</div><div class='line' id='LC2'>83109   production.table Locked - read         Low priority read lock</div><div class='line' id='LC3'><br/></div></pre></div>
          </div>

          <div class="gist-meta">
            <a href="https://gist.github.com/raw/1740568/af322a6d6ad4def63743cdfa5aca9218f9079bac/gistfile1.txt" style="float:right;">view raw</a>
            <a href="https://gist.github.com/1740568#file_gistfile1.txt" style="float:right;margin-right:10px;color:#666">gistfile1.txt</a>
            <a href="https://gist.github.com/1740568">This Gist</a> brought to you by <a href="http://github.com">GitHub</a>.
          </div>
        </div>
</div>

<p>We were in an outage and the most important thing at this point was to resume selling shoes, dresses, and lingerie, so I  collected as much data as I could for later review, dumped it into Evernote and killed the RUN_LONG_TIME query.  Bam, the queries in &#8220;Waiting for table&#8221; state finished and the site came back online.  Had that not solved the problem, another team member had his finger on the &#8220;fail over to the other server&#8221; button.  </p>
<p>Outage over.  Phew.</p>
<p>But, as my toddler likes to say &#8212; &#8220;What just happened?&#8221;  The RUN_LONG_TIME query was expensive, but it shouldn&#8217;t have been blocking other queries from completing.  First step, I went to a reporting server and tried to reproduce it:</p>
<div id="gist-1740580" class="gist">

        <div class="gist-file">
          <div class="gist-data gist-syntax">
              <div class="highlight"><pre><div class='line' id='LC1'><span class="n">session1</span><span class="o">&gt;</span> <span class="k">SELECT</span> <span class="n">RUN_LONG_TIME</span> <span class="k">FROM</span> <span class="k">table</span><span class="p">;</span></div><div class='line' id='LC2'><span class="n">session2</span><span class="o">&gt;</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="k">table</span> <span class="k">WHERE</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">123</span></div><div class='line' id='LC3'><br/></div></pre></div>
          </div>

          <div class="gist-meta">
            <a href="https://gist.github.com/raw/1740580/00496e17197f74495c5b26d7677e1a171fea2317/gistfile1.sql" style="float:right;">view raw</a>
            <a href="https://gist.github.com/1740580#file_gistfile1.sql" style="float:right;margin-right:10px;color:#666">gistfile1.sql</a>
            <a href="https://gist.github.com/1740580">This Gist</a> brought to you by <a href="http://github.com">GitHub</a>.
          </div>
        </div>
</div>

<p>All copasetic.  What&#8217;s next, chief?</p>
<p>Time to look at some graphs.  Because we have the complete output of SHOW GLOBAL STATUS logging to Graphite every few seconds, it is easy see what the server is doing at any given time.  (You should do that, too.  It&#8217;s incredibly valuable.)  I started poking around at the charts on the active server and noticed a few oddities:</p>
<p>There was a lot of InnoDB buffer pool activity &#8211; several graphs looked like this:<br />
<a href="http://blog.9minutesnooze.com/analyze-table-replicated-rtfm/innodb_buffer_pool_read_requests/" rel="attachment wp-att-434"><img src="http://s-blog.9minutesnooze.com/wp-content/uploads/2012/02/innodb_buffer_pool_read_requests.jpg" alt="" title="innodb_buffer_pool_read_requests" width="585" height="306" class="aligncenter size-full wp-image-434" /></a></p>
<p>That made sense, as the RUN_LONG_TIME query was sifting through a lot of data.  A lot of data.  A lot. 14 quadrillion rows, in my estimate.</p>
<p>After seeing that pattern across a number of other stats, I started poking through the Com_* variables.  Com_analyze looked like this:<br />
<a href="http://blog.9minutesnooze.com/analyze-table-replicated-rtfm/com_analyze/" rel="attachment wp-att-435"><img src="http://s-blog.9minutesnooze.com/wp-content/uploads/2012/02/com_analyze.jpg" alt="" title="com_analyze" width="586" height="307" class="aligncenter size-full wp-image-435" /></a></p>
<p>What fool ran ANALYZE TABLE a bunch of times at peak traffic on the active database!?  This is where I contracted a case of the RTFMs.  As it turns out, ANALYZE TABLE statements are written to the binary log and thus replicated unless you supply the LOCAL key word (ANALYZE LOCAL TABLE).  </p>
<p>I had not supplied that keyword.</p>
<p>As a result of my missing keyword, the ANALYZE TABLE statements replicated to the active server during peak traffic periods while a very long running query was in progress.  Intuitively that still shouldn&#8217;t have caused this behavior.  ANALYZE TABLE takes less than a second on each table.  But that isn&#8217;t the whole story&#8230;</p>
<p>Back to the reporting server to attempt to reproduce the behavior:</p>
<div id="gist-1740615" class="gist">

        <div class="gist-file">
          <div class="gist-data gist-syntax">
              <div class="highlight"><pre><div class='line' id='LC1'><span class="n">session1</span><span class="o">&gt;</span> <span class="k">SELECT</span> <span class="n">RUN_LONG_TIME</span> <span class="k">FROM</span> <span class="k">table</span><span class="p">;</span></div><div class='line' id='LC2'><span class="n">session2</span><span class="o">&gt;</span> <span class="k">ANALYZE</span> <span class="k">TABLE</span> <span class="k">table</span><span class="p">;</span></div><div class='line' id='LC3'><span class="n">session3</span><span class="o">&gt;</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="k">table</span> <span class="k">WHERE</span> <span class="n">id</span><span class="o">=</span><span class="mi">123</span><span class="p">;</span> </div><div class='line' id='LC4'><br/></div></pre></div>
          </div>

          <div class="gist-meta">
            <a href="https://gist.github.com/raw/1740615/951c4570433d6de059696edf9cb49820c34fe850/gistfile1.sql" style="float:right;">view raw</a>
            <a href="https://gist.github.com/1740615#file_gistfile1.sql" style="float:right;margin-right:10px;color:#666">gistfile1.sql</a>
            <a href="https://gist.github.com/1740615">This Gist</a> brought to you by <a href="http://github.com">GitHub</a>.
          </div>
        </div>
</div>

<p>The statement in session3 hung and was in &#8220;Waiting for table&#8221; status.  Success (at failure)!</p>
<p>What happened is the ANALYZE TABLE flushed the table, which tells InnoDB to close all references before allowing access again.  Because there was a query running while ANALYZE TABLE was executing, MySQL had to wait for the query to complete before allowing access from another thread.  Because that query took so long, everything else hung out in  &#8220;Waiting for table&#8221; state.  The <a href="http://dev.mysql.com/doc/refman/5.1/en/general-thread-states.html">documentation</a> on this point sort of explains the issue, though it is a little muddy:</p>
<blockquote><p>The thread got a notification that the underlying structure for a table has changed and it needs to reopen the table to get the new structure. However, to reopen the table, it must wait until all other threads have closed the table in question.</p>
<p>This notification takes place if another thread has used FLUSH TABLES or one of the following statements on the table in question: FLUSH TABLES tbl_name, ALTER TABLE, RENAME TABLE, REPAIR TABLE, ANALYZE TABLE, or OPTIMIZE TABLE.
</p></blockquote>
<p>I explained the sequence of events and root cause to our team and also publicly flogged myself a bit.  As it turns out, this issue only happened because of the combination of two different events happening simultaneously.  The ANALYZE TABLE alone wouldn&#8217;t have been a big deal had there not also been a very long running query going at the same time.</p>
<p>I have a few take-aways from this:</p>
<ul>
<li>If you make a mistake, fess up.  That&#8217;s a lot better than covering it up and having someone find out about it later.  People understand mistakes.</li>
<li>Mistakes are the best chances for learning.  I can assure you, that I will never, ever forget that ANALYZE TABLE writes to the binary log.</li>
<li>Measure everything that you can, always.  Without the output of SHOW GLOBAL STATUS being constantly charted in Graphite, I would have been blind to any abnormalities.</li>
<li>During an outage, resist the temptation to just &#8220;fix it&#8221; before grabbing data to analyze later.  Pressure is on and getting things running is very high priority, but it is even worse if you fix the problem, don&#8217;t know why it occurred, and end up in the same situation again a week later.</li>
<li>Try not to perform seemingly innocuous tasks on production servers at peak times.</li>
<li>RTFM.  Always.  Edge cases abound in complex software.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://blog.9minutesnooze.com/analyze-table-replicated-rtfm/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Finding Problem Queries @ ideeli Tech Blog</title>
		<link>http://blog.9minutesnooze.com/finding-problem-queries-ideeli-tech-blog/</link>
		<comments>http://blog.9minutesnooze.com/finding-problem-queries-ideeli-tech-blog/#comments</comments>
		<pubDate>Thu, 01 Dec 2011 02:41:26 +0000</pubDate>
		<dc:creator>Aaron</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[mysql ideeli]]></category>

		<guid isPermaLink="false">http://blog.9minutesnooze.com/?p=418</guid>
		<description><![CDATA[I work at ideeli and wrote up a two parter for our Tech Blog about finding problem queries in MySQL&#8230; Finding Problem Queries, Part 1: The Slow Stuff]]></description>
			<content:encoded><![CDATA[<p>I work at <a href="http://www.ideeli.com">ideeli</a> and wrote up a two parter for our Tech Blog about finding problem queries in MySQL&#8230;</p>
<p><a href="http://insatiabledemand.ideeli.com/post/13508307991/finding-problem-queries-part-1-the-slow-stuff" title="Finding Problem Queries, Part 1: The Slow Stuff">Finding Problem Queries, Part 1: The Slow Stuff</a></p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fblog.9minutesnooze.com%2Ffinding-problem-queries-ideeli-tech-blog%2F&amp;title=Finding%20Problem%20Queries%20%40%20ideeli%20Tech%20Blog" id="wpa2a_2"><img src="http://blog.9minutesnooze.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.9minutesnooze.com/finding-problem-queries-ideeli-tech-blog/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Analyzing HTTP traffic with tcpdump and Percona&#8217;s pt-tcp-model</title>
		<link>http://blog.9minutesnooze.com/analyzing-http-traffic-tcpdump-perconas-pttcpmodel/</link>
		<comments>http://blog.9minutesnooze.com/analyzing-http-traffic-tcpdump-perconas-pttcpmodel/#comments</comments>
		<pubDate>Wed, 12 Oct 2011 02:20:41 +0000</pubDate>
		<dc:creator>Aaron</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.9minutesnooze.com/?p=393</guid>
		<description><![CDATA[I recently ran into an issue where our request throughput was showing very erratic and spikey behavior despite very smooth response times from the application servers. Using Splunk, we analyzed every log that we had: nginx, haproxy, apache, and the &#8230; <a href="http://blog.9minutesnooze.com/analyzing-http-traffic-tcpdump-perconas-pttcpmodel/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I recently ran into an issue where our request throughput was showing very erratic and spikey behavior despite very smooth response times from the application servers.  Using Splunk, we analyzed every log that we had: nginx, haproxy, apache, and the application logs themselves and we were seeing similarly spikey throughput.  Because those tools all log upon request completion, there was no way to determine from the logs themselves whether it was one tier of the stack in particular that was delaying request arrival, or if it the spikes were endemic to the traffic we were receiving.</p>
<p>So, we decided to perform some analysis of the raw tcp data on the edge server using a couple of tools.  First, was tcpdump, which is a tool that should be in every SysAdmin&#8217;s arsenal.</p>
<p>First, grab all the traffic on the interface and write it to a pcap formatted file:</p>
<pre>
# tcpdump -c 200000 -w output.pcap -i any
</pre>
<p>This command will capture 200k packets from any interface and write them to output.pcap, which can be later analyzed with a variety of tools, including tcpdump and wireshark.  </p>
<p>All we care about is the actual packet count and only for &#8220;real&#8221; packets (no SYN/ACKs) on port 80.  Extract this data from the capture we just made:</p>
<pre>
# tcpdump -r output.pcap -s 384 -i any -nnq -tttt \
      'tcp port 80 and (((ip[2:2] - ((ip[0]&#038;0xf)<<2))
     - ((tcp[12]&#038;0xf0)>>2)) != 0)' > port80.txt
</pre>
<p>(I stole this command from the <a href="http://www.percona.com/doc/percona-toolkit/pt-tcp-model.html">pt-tcp-model documentation</a> and honestly have not dived into the details of how the part after &#8216;tcp port 80&#8242; actually works).</p>
<p>Producing some data that looks like this (IPs hidden to protect the innocent):</p>
<pre>
...
2011-10-10 12:49:02.662951 IP 175.x.x.x.ppppp > 10.x.x.x.x: tcp 37
2011-10-10 12:49:02.662958 IP 98.x.x.x.ppppp > 10.x.x.x.x: tcp 1380
2011-10-10 12:49:02.662963 IP 98.x.x.x.ppppp > 10.x.x.x.x: tcp 80
2011-10-10 12:49:02.662965 IP 98.x.x.x.ppppp > 10.x.x.x.x: tcp 463
2011-10-10 12:49:02.662968 IP 206.x.x.x.ppppp > 10.x.x.x.x: tcp 516
...
</pre>
<p>With a little bash, we can aggregate this data per second (I&#8217;m sure that there is a much more concise way of doing this, but it gets the job done):</p>
<pre>
# cut -c 12-21 port80.txt  |awk '{print $1}' |  sort | uniq -c  | awk '{print $2 " " $1}' > packets_per_sec.txt
</pre>
<p>The output looks like this&#8230;packets per second grouped by second:</p>
<pre>
12:49:02 912
12:49:03 2617
12:49:04 2277
12:49:05 1994
12:49:06 2120
12:49:07 2192
</pre>
<p>Next, it&#8217;s some simple gnuplot magic to chart it all out.  Here&#8217;s the plot file:</p>
<pre>
set title "TCP Port 80 packets/sec"
set terminal aqua enhanced title "TCP Port 80 - packets/sec"
set xdata time
set xlabel "Time (EST)"
set timefmt "%H:%M:%S"
set format x "%H:%M:%S"
set ylabel "per second"
set datafile separator " "

set style line 1 linecolor rgb "#000000" lw 1

plot 'packets.txt' using 1:2 title "packets" with line ls 1

set terminal png font "/Library/Fonts/Arial.ttf"
set output "packets_count.png"

replot
</pre>
<p>Run that with<br />
<code><br />
# gnuplot file.plot<br />
</code></p>
<p>If you are on a Mac, AquaTerm will probably pop up and show you the graph.  If not, you can open the packets_count.png file.  What I got, looked like this:<br />
<div id="attachment_397" class="wp-caption aligncenter" style="width: 310px"><a href="http://blog.9minutesnooze.com/analyzing-http-traffic-tcpdump-perconas-pttcpmodel/20111012-xbj5fusiww7cn8hyf4budk1s7x/" rel="attachment wp-att-397"><img src="http://s-blog.9minutesnooze.com/wp-content/uploads/2011/10/20111012-xbj5fusiww7cn8hyf4budk1s7x-300x209.jpg" alt="Packets/Sec" title="20111012-xbj5fusiww7cn8hyf4budk1s7x" width="300" height="209" class="size-medium wp-image-397" /></a><p class="wp-caption-text">Packets/Sec</p></div></p>
<p>Ugly, eh?  I&#8217;ve pointed out some problem areas with arrows.  The rate of packet arrivals is incredibly variable &#8211; much more so than I would expect when the website is receiving 10s of thousands of requests per minute.  At such rough granularity, I would expect to see a much smoother line.</p>
<p>That&#8217;s great and all&#8230;clearly there is a problem, but packets != requests.  I want to know how this affects the end user.</p>
<p>Enter <a href="http://www.percona.com/doc/percona-toolkit/pt-tcp-model.html">pt-tcp-model</a> from the <a href="http://www.percona.com/software/percona-toolkit/">Percona Toolkit</a> (formerly <a href="http://www.maatkit.org/">Maatkit</a> by <a href="http://www.xaprb.com/blog/">Baron Schwartz</a>, Chief Performance Architect of <a href="http://percona.com">Percona</a> and author of <a href="http://www.amazon.com/High-Performance-MySQL-Optimization-Replication/dp/0596101716">High Performance MySQL</a>).</p>
<p>In short, this tool will take data from tcpdump and convert the data in the packet headers into time sliced buckets with the number of request arrivals, completions, and other summary data.  That can then be charted with gnuplot (or Excel, if you are so inclined) to get some pretty interesting results.  To better understand the tool, I recommend you read the <a href="http://www.percona.com/doc/percona-toolkit/pt-tcp-model.html">documentation</a> and watch Baron&#8217;s presentation about <a href="http://www.percona.tv/percona-live/measuring-scalability-and-performance-with-tcp">Measuring Scalability and Performance with TCP</a>.  </p>
<p>First, following the directions verbatim, extract the data into requests and their response times, and slice that into 1 second intervals.  One thing to note is that if your source data (port80.txt) contains more than about 300k lines, the tool starts to bog down a bit, so I&#8217;d recommend trying to work with smaller samples.</p>
<pre>
# pt-tcp-model port80.txt > requests.txt
# sort -n -k1,1 requests.txt > sorted.txt
# pt-tcp-model --type=requests --run-time=11 sorted.txt > sliced.txt
</pre>
<p>Now, you have sliced.txt which looks something like this:</p>
<pre>
1318265342 18.49   578.542   195   171 0.337054 6.232731 9.455878 0.190278 0.443786 0.337054
1318265343 27.51   527.000   527   526 1.000000 27.511997 27.842869 0.348546 0.583430 1.000000
1318265344 20.75   504.000   504   509 1.000000 20.748874 26.378252 1.166846 0.766312 1.000000
1318265345 23.96   461.000   461   462 1.000000 23.963005 32.181679 3.929070 0.679943 1.000000
1318265346 23.60   438.000   438   423 1.000000 23.595860 26.154968 0.421166 0.939690 1.000000
</pre>
<p>The columns are in the documentation, but in this case, I&#8217;m mostly interested in graphing time vs the number of complete requests arriving (columns 1 and 4).</p>
<p>Here&#8217;s some gnuplot for that:<br />
<code><br />
set title "TCP Port 80 - arrivals/sec"<br />
set terminal aqua enhanced title "TCP Port 80 - arrivals/sec"<br />
set xdata time<br />
set xlabel "Time (UTC)"<br />
set timefmt "%s"<br />
set format x "%H:%M:%S"<br />
set ylabel "per second"<br />
set datafile separator " "<br />
set style line 1 linecolor rgb "#000000"</p>
<p>plot 'sliced.txt' using 1:4 title "arrivals" with line ls 1</p>
<p>set terminal png font "/Library/Fonts/Arial.ttf"<br />
set output "lblockups.png"</p>
<p>replot<br />
</code></p>
<p>And here is the chart that I ended up with:<br />
<div id="attachment_402" class="wp-caption aligncenter" style="width: 310px"><a href="http://blog.9minutesnooze.com/analyzing-http-traffic-tcpdump-perconas-pttcpmodel/screen-shot-2011-10-10-at-2-33-40-pm/" rel="attachment wp-att-402"><img src="http://s-blog.9minutesnooze.com/wp-content/uploads/2011/10/Screen-Shot-2011-10-10-at-2.33.40-PM-300x211.png" alt="Arrivals/sec" title="Arrivals/sec" width="300" height="211" class="size-medium wp-image-402" /></a><p class="wp-caption-text">Arrivals/sec</p></div></p>
<p>Now&#8230;that&#8230;is&#8230;ugly.  When I dig into the raw data, the jagged packet arrival rate is frequently causing 1-2 second delays in the arrival rate of individual requests to our edge server.  That is before we even get to nginx, so our app server had no hope.  What this means, from a performance standpoint, is that the application stack has to be able to accommodate the huge influx of traffic after the lull.  This presents a scalability nightmare, especially for an e-commerce website heading into the holiday season.</p>
<p>Time to call the data center&#8230;</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fblog.9minutesnooze.com%2Fanalyzing-http-traffic-tcpdump-perconas-pttcpmodel%2F&amp;title=Analyzing%20HTTP%20traffic%20with%20tcpdump%20and%20Percona%26%238217%3Bs%20pt-tcp-model" id="wpa2a_4"><img src="http://blog.9minutesnooze.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.9minutesnooze.com/analyzing-http-traffic-tcpdump-perconas-pttcpmodel/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>RAID 10 your EBS data</title>
		<link>http://blog.9minutesnooze.com/raid-10-ebs-data/</link>
		<comments>http://blog.9minutesnooze.com/raid-10-ebs-data/#comments</comments>
		<pubDate>Sat, 23 Jul 2011 02:45:14 +0000</pubDate>
		<dc:creator>Aaron</dc:creator>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[amazon]]></category>
		<category><![CDATA[aws]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[ebs]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[syseng]]></category>

		<guid isPermaLink="false">http://blog.9minutesnooze.com/?p=356</guid>
		<description><![CDATA[When I spoke at Percona Live (video here) on running an E-commerce database in Amazon EC2, I briefly talked about using RAID 10 for additional performance and fault tolerance when using EBS volumes. At first, this seems counter intuitive. Amazon &#8230; <a href="http://blog.9minutesnooze.com/raid-10-ebs-data/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>When I spoke at <a href="http://blog.9minutesnooze.com/percona-live-nyc-2011/">Percona Live</a> (<a href="http://www.percona.tv/percona-live/running-an-e-commerce-database-in-the-cloud">video here</a>) on running an E-commerce database in Amazon EC2, I briefly talked about using RAID 10 for additional performance and fault tolerance when using EBS volumes.  At first, this seems counter intuitive.  Amazon has a robust infrastructure, EBS volumes run on RAIDed hardware, and are mirrored in multiple availability zones.  So, why bother?  Today, I was reminded of just how important it is.  Please note that all my performance statistics are based on direct experience running a MySQL database on a m2.4xlarge instance and not on some random bonnie or orion benchmark.  I have those graphs floating around on my hard drive in glorious 3D and, while interesting, they do not necessarily reflect real-life performance.</p>
<h3>Why?  Part 1. Performance</h3>
<p>Let&#8217;s get to the point.  EBS is cool and very very flexible, but nominal performance is poor and highly variable with average latencies (svctime in iostat) in the 2-10ms range .  At its heart, EBS is Network Attached Storage and shares bandwidth with your instance NIC.  At best, I see 1.5ms svctime and 10ms await, and at worst&#8230;well, at worst you don&#8217;t need ms precision to measure it.  On top of that, a single EBS volume seems to peak out at around 100-150 iops, which is about what one would expect from a single SATA drive.  That&#8217;s fine if you&#8217;re running a low-traffic website with very little disk activity, but once the requests start to come in, things get a little squirrelly.  Add in multi-tenancy and a noisy neighbor can really beat your disk into submission.</p>
<p>So, what&#8217;s a lowly Systems Engineer to do when the iowait time starts to pile up?  Well, it turns out that those IOPs are initially bound by the disk on the backend and not local NIC traffic, so you can use Linux Software RAID to significantly improve the I/O capacity of your disk (but not the latency or variability&#8230;more on this later).  For a performance boost, there is a lot of bad advice on the Internet saying you should RAID 0 your disk (because &#8220;it&#8217;s redundant on the back end&#8221;), but to the the discriminating SysEng, that should scream bad idea.</p>
<h3>Why? Part 2. Redundancy</h3>
<p>Right, so EBS is RAIDed and mirrored in multiple availability zones on the back end, so why do I need to worry about redundancy?  That&#8217;s great and all, but with the EBS cool factor comes additional complexity and new and unexpected failure modes.  The first and most obvious was #ec2pocalypse, otherwise known as the Great Reddit Fail of 2011.  If you&#8217;re not aware of what happened (and the details are somewhat irrelevant), but a couple months back someone pressed the wrong button at Amazon and a significant percentage of EBS volumes became &#8220;stuck&#8221; showing 100% utilization and no iops.  This failure lasted several days and took out a large number of websites that based their infrastructure on EBS.  Most of the data itself was recovered, but a small percentage of people were SOL.  So much for redundancy.</p>
<p>Enter RAID10.  Yes, it&#8217;s slower than RAID0 because you have to write twice.  Yes, you are bound by the worst performing disk in the array.  But, you can get nearly 1:1 increase in IOPs (up to a point) and gain the ability to recover your data when Amazon drops the ball.</p>
<p>You need proof?  &#8220;Give me an example,&#8221; you say?  Let&#8217;s talk about what happened to me today.  Everything was just peachy all day &#8211; performance was within parameters and then at 3:15PM, all of a sudden the database started having random query pile ups.  Being in EC2, this was not unexpected, but it kept happening.  Traffic was on a decline, but we were expecting big traffic in an hour or so.  So, I started looking at the disk.  We have a 10-drive RAID10 array on our master DB and 1 of those disks was showing svctime in the 30-100ms range, vs 2-10ms on all the others.  BINGO!</p>
<p>I didn&#8217;t save the actual iostat output, but sar showed this:</p>
<pre>
03:15:01 PM DEV       tps avgqu-sz  await svctm %util
03:35:01 PM dev8-133 7.78     0.11  13.49  2.28  1.77
03:35:01 PM dev8-130 6.54     0.09  14.14  2.27  1.48
03:35:01 PM dev8-149 8.34     0.11  12.62  2.08  1.74
03:35:01 PM dev8-132 7.67     0.10  13.29  1.98  1.52
03:35:01 PM dev8-131 8.66     0.11  12.27  1.91  1.65
03:35:01 PM dev8-147 7.13     0.10  13.77  2.13  1.52
03:35:01 PM dev8-129 7.58     0.08  10.56  1.73  1.31
03:35:01 PM dev8-148 8.47     4.30 506.96 54.77 46.36
03:35:01 PM dev8-146 8.17     0.08   9.28  1.38  1.13
03:35:01 PM dev8-145 6.70     0.26  39.36  6.87  4.60
</pre>
<p>dev8-148 sure looks fishy, eh? (Oh, side note&#8230;to align this data all pretty-like, I used the aptly named <a href="http://aspersa.googlecode.com/svn/html/align.html">align</a>, a great tool from the <a href="http://aspersa.googlecode.com/svn/html/index.html">Aspersa Toolkit</a>)</p>
<p>Had this been a single volume EBS or RAID0 volume, we would have been forced to perform a database failover to a secondary master and redirect the application, which would have interrupted sales briefly during an active time.  Instead, thanks to RAID10, we have options.  Instead of a failover during a period of relatively high traffic, we simply failed out the problem drive.  Now we were running on 9 drives and with reduced redundancy, but performance immediately recovered and the stalls stopped.  We can replace the drive later and resync the array when traffic is low.</p>
<h3>How?</h3>
<p>First, you need to create and attach &#8220;a bunch&#8221; of volumes to your instance.  How many?  I&#8217;ve seen diminishing returns after 8-10 disks, but your mileage (and instance size) may vary.  Typical RAID10 rules apply here&#8230;you need 2x the total capacity and each disk has to equal 2*(capacity)/(num disks), so if you need 1TB usable and want to use 8 disks, you will need each disk to be 256GB.</p>
<p>Here&#8217;s some code to do that.  It creates 8x256GB volumes in the us-east-1a zone and then attaches them to instance i-1a2b3c4d</p>
<pre>
for x in {1..8); do \
  ec2-create-volume --size 256 --zone us-east-1a; \
done &gt; /tmp/vols.txt

(i=0; \
for vol in $(awk '{print $2}' /tmp/vols.txt); do \
  i=$(( i + 1 )); \
  ec2-attach-volume $vol -i i-1a2b3c4d -d /dev/sdh${i}; \
done)
</pre>
<p>Then, you need to install Linux Software RAID.  On Debian or Ubuntu:<br />
<code>apt-get install mdadm</code></p>
<p>Then, create a new RAID 10 (-l10) volume from 8 disks (-n8):<br />
<code>mdadm --create -l10 -n8 /dev/md0 /dev/sdh*</code></p>
<p>With any luck, you&#8217;ll get a message saying that the array was started.  You can verify this by looking at /proc/mdstat and you should see something like this (the numbers in this example are probably off.  I pulled them together from some random machines)</p>
<pre>
cat /proc/mdstat
Personalities : [raid10]
md0 : active raid10 sdh6[5] sdh5[4] sdh4[3] sdh3[2] sdh2[1] sdh1[0]
      1048575872 blocks 64K chunks 2 near-copies [6/6] [UUUUUU]
      [==>..................]  resync = 13.3% (431292736/3221225280) finish=7721.9min speed=6021K/sec
</pre>
<p>Your disk will spend a lot of time and IOPs resyncing, but you can format /dev/md0 and mount it right away.</p>
<p>This wasn&#8217;t meant as a complete guide to Linux Software RAID &#8211; if you want to know more, check out <a href="http://tldp.org/HOWTO/Software-RAID-HOWTO.html">The Software-RAID HOWTO</a>.</p>
<h3>The Bad</h3>
<p>Ok, so the observant among you will realize that by having 8 or 10 disks in the array, all with the potential to have severe performance degradation like this, I have drastically increased the variability of latency.  Well, you would be right, but&#8230;</p>
<ol>
<li>I can&#8217;t get IOPs any other way in EC2</li>
<li>It is easy to recover from the most common failure mode with this setup</li>
<li>If you care about your data at all, RAID0 (or no RAID) is doing it wrong</li>
</ol>
<p>Remember, kids&#8230;Friends don&#8217;t let friends RAID0.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fblog.9minutesnooze.com%2Fraid-10-ebs-data%2F&amp;title=RAID%2010%20your%20EBS%20data" id="wpa2a_6"><img src="http://blog.9minutesnooze.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.9minutesnooze.com/raid-10-ebs-data/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Percona Live NYC 2011</title>
		<link>http://blog.9minutesnooze.com/percona-live-nyc-2011/</link>
		<comments>http://blog.9minutesnooze.com/percona-live-nyc-2011/#comments</comments>
		<pubDate>Sat, 04 Jun 2011 18:15:05 +0000</pubDate>
		<dc:creator>Aaron</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.9minutesnooze.com/?p=341</guid>
		<description><![CDATA[A couple weeks back, I had the fortune of co-speaking at Percona Live NYC 2011 with Mark Uhrmacher, CTO of ideeli on the subject of running an E-commerce site with MySQL in the cloud. Interestingly, and a sign of the &#8230; <a href="http://blog.9minutesnooze.com/percona-live-nyc-2011/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>A couple weeks back, I had the fortune of co-speaking at <a href="http://www.percona.com/live/nyc-2011/">Percona Live NYC 2011</a> with Mark Uhrmacher, CTO of <a href="http://ideeli.com">ideeli</a> on the subject of running an E-commerce site with MySQL in the cloud.  Interestingly, and a sign of the times, this was also the first time that I had ever met Mark, despite having worked for him for close to a year since I telecommute from home.</p>
<p><a href="http://9minutesnooze.com/download/Percona_Live_NYC_2011_MySQL_Cloud.pdf">Here are the slides</a> from that talk.</p>
<p>What I wanted attendees to get out of our talk was that you have to expect and plan for all sorts of failure situations when your database is in the cloud.  Relative to conventional hosting or datacenters, things in the cloud break more frequently and in ways that are out of your control.  AWS gives you the tools to plan and recover from these failures much more easily than having to put redundant physical servers in multiple geographic locations, but they also fail more often.</p>
<p>So, here are a few take-aways, mentioned in the slides</p>
<ul>
<li>RAID 1 or RAID 10 (1+0) your EBS volumes<br/>
<p>Yes, EBS volumes are redundant on the back end, in a data center controlled by Amazon.  However, the great EBS outage of 2011 (#ec2pocalypse) proves that you cannot entrust your data to a single technology that is out of your control.  Had we RAID0&#8242;d our data set, we would have been in much worse shape, because we would have to completely rebuild many of our data sets from backup.  So, no, you should not RAID0 (which should rightfully be called AID0, since the R is a fallacy).  Yes, you take a performance hit, and you have to deal with lowest-common-denominator performance of the EBS volumes, but the ability to remove a failed or poorly performing EBS volume without losing your data more than makes up for that compromise.   With 10 EBS volumes in a RAID 10 configuration, we max out at around 1200-1500 iops.  Poor performance relative to physical hardware, but it is manageable.</p>
<p>If you care about your data, never ever use RAID0.  You might as well just point it at <tt>/dev/null</tt>, which as we know is <a href="http://www.xtranormal.com/watch/6995033/mongo-db-is-web-scale">webscale</a>.  <b>Friends don&#8217;t let friends RAID0</b></p>
</li>
<li>
Make sure your important data lives in multiple availability zones and multiple regions.</p>
<p>
During #ec2pocalypse, several instances were able to be recovered by simply pointing the application at data that already existed in another zone.
</p>
</li>
<li>
Don&#8217;t cross availability zones and regions between your ultimate master and your disaster recovery node.</p>
<p>
If so, (and we were bit by this), you may end up with out of date disaster recovery nodes if your distribution slave is in an affected availability zone.  Keep replication chains short and all in one zone/region, except for the DR node, which should be somewhere outside of the master&#8217;s zone/region.
</p>
</li>
</li>
<p>AWS snapshot backups are awesome.  But they don&#8217;t help if the API is down.  Make sure your data lives in multiple places where you can get at it in an emergency.
</li>
</ul>
<p>Also, I&#8217;d just like to say that Percona Live was a great conference.  There were some incredibly informative talks.  My favorite, by far, was Baron Schwartz&#8217;s discussion on using tcpdump to analyze server performance and predict scalability.  I was honored to speak in front of a crowd where the average person in the room knows far more about MySQL than I do.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fblog.9minutesnooze.com%2Fpercona-live-nyc-2011%2F&amp;title=Percona%20Live%20NYC%202011" id="wpa2a_8"><img src="http://blog.9minutesnooze.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.9minutesnooze.com/percona-live-nyc-2011/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Get Hulu working on Boxee (again)</title>
		<link>http://blog.9minutesnooze.com/hulu-working-boxee/</link>
		<comments>http://blog.9minutesnooze.com/hulu-working-boxee/#comments</comments>
		<pubDate>Sun, 21 Nov 2010 16:51:46 +0000</pubDate>
		<dc:creator>Aaron</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[boxee]]></category>
		<category><![CDATA[fools]]></category>
		<category><![CDATA[htpc]]></category>
		<category><![CDATA[hulu]]></category>
		<category><![CDATA[idiots]]></category>
		<category><![CDATA[media center]]></category>
		<category><![CDATA[morons]]></category>

		<guid isPermaLink="false">http://blog.9minutesnooze.com/?p=309</guid>
		<description><![CDATA[&#60;rant&#62;In a move that defies any reason, the short-sighted bonehead executives at Hulu (or perhaps NBC, but really&#8230;who cares?) decided that they don&#8217;t want advertising dollars from the thousands of Boxee and Boxee Box users, and instead, would prefer that &#8230; <a href="http://blog.9minutesnooze.com/hulu-working-boxee/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>&lt;rant&gt;In a move that defies any reason, the short-sighted bonehead executives at Hulu (or perhaps NBC, but really&#8230;who cares?) decided that they don&#8217;t want advertising dollars from the thousands of Boxee and Boxee Box users, and instead, would prefer that people simply pirate their media instead since it is higher quality, easier to get, and has no advertisements.  Hey, guys at Hulu&#8230;wake up.  It&#8217;s not 2000 anymore.&lt;/rant&gt;  </p>
<p>Anyhow, <a href="http://forums.boxee.tv/showthread.php?t=22613">a very smart fellow</a> over at the Boxee Forums figured out how to work around the issue with a little bit of javascript&#8230;</p>
<p><strong>Disclaimer:</strong> This might make your computer explode, your network implode, and format your nodes.  I&#8217;m not responsible, nor is <a href="http://forums.boxee.tv/member.php?u=47501">jzongker</a> over on the Boxee Forums.</p>
<p>Simply save the following code as hulu.js (<a href="http://9minutesnooze.com/download/hulu.js">download link</a>) and put it in the following location:</p>
<table>
<tr>
<td>Mac</td>
<td>/Applications/Boxee.app/Contents/Resources/ Boxee/system/players/flashplayer/hulu.js</td>
</tr>
<tr>
<td>Linux</td>
<td>[Boxeepath]/system/players/flashplayer/hulu.js</td>
</tr>
<tr>
<td>Windows</td>
<td>probably [Boxeepath]\system\players\flashplayer\hulu.js in Program Files</td>
</tr>
<tr>
<td>Boxee Box</td>
<td>Apparently this technique does not work</td>
</tr>
</table>
<hr/>
<pre>boxee.browserWidth=1280;
boxee.browserHeight=720;
boxee.earlyTimers = true;
boxee.enableLog(true);

boxee.onInit = function() {
   browser.setConfigChar('general.useragent.override','Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/540.0 (KHTML, like Gecko) Ubuntu/10.10 Chrome/9.1.0.0 Safari/540.0');
}

if (boxee.getVersion() &lt; 5)
   boxee.renderBrowser = true;

boxee.parseBoxeeTags = false;
boxee.autoChoosePlayer = false;

var current    = 0;
var h_width    = 720;
var h_bottom   = 23;
var started    = false;
var active     = false;
var duration   = false;
var is_paused  = false;
var alt_player = false;

boxee.onBack = function()  { boxee.onEnter(); }
boxee.onLeft = function()  { boxee.onEnter(); }
boxee.onRight = function() { boxee.onEnter(); }
boxee.onUp = function()    { boxee.onEnter(); }
boxee.onDown = function()  { boxee.onEnter(); }

wmodeFix = setInterval(function() {
   boxee.getWidgets().forEach(function(widget) {
      zorder_id = widget.getAttribute("id");
      if (zorder_id == 'banner_c')
         browser.execute('document.getElementById("'+zorder_id+'").style.zIndex = 99999;');
   });
}, 500);

boxee.onDocumentLoaded = function() {
   boxee.setMode(1);
   boxee.showNotification("[B]Press Enter to view full screen[/B]", ".", 500);
}

boxee.onEnter = function()
{
   boxee.setMode(0);

   if (boxee.getVersion() &lt; 5)       browser.execute('window.scrollTo(0,50);');    clearInterval(wmodeFix);    boxee.showNotification("[B]Switching to full screen...[/B]", ".", 2);    playerTimer = setInterval(function(){       if (!active) locatePlayer();       else updateProgress();    }, 1000) } function playerReference() {    id = boxee.getActiveWidget().getAttribute('id');    if (id.length &gt; 0)
      return 'document.'+id+'.';

   else if (alt_player != false)
      return alt_player;

   else
   {
      var locateMe = "(function(){objects=document.getElementsByTagName('embed'); for (var i in objects) { if (objects[i].getAttribute('src') == '"+boxee.getActiveWidget().getAttribute('src')+"') return i; }})()";
      locateMe = browser.execute(locateMe);
      if (locateMe &gt; 0)
      {
         alt_player = 'document.getElementsByTagName("embed")['+locateMe+'].';
         return alt_player;
      }
      else
         return 'document.player.';
   }
}

function updateProgress()
{
   if (!duration)
      duration = parseInt(browser.execute(playerReference()+'getDuration()')) / 1000;

   if (duration)
      boxee.setDuration(duration);

   current = parseInt(browser.execute(playerReference()+'getCurrentTime()')) / 1000;
   if (isNaN(current))
      alt_player = false;

   if (current &gt; 0 &amp;&amp; !started)
      started = true;

   progress = current / duration * 100;
   alert(progress);
   boxee.notifyCurrentTime(current);
   boxee.notifyCurrentProgress(progress);

   if (started &amp;&amp; progress &gt; 99.9)
      boxee.notifyPlaybackEnded();
}

function locatePlayer()
{
   boxee.getWidgets().forEach(function(widget) {
      flashvars = widget.getAttribute("flashvars");
      if (flashvars.indexOf('hulu.com/watch') != -1 &amp;&amp; flashvars.indexOf('bitrate=') != -1 &amp;&amp; !active) {
         active = true;
         boxee.renderBrowser = false;
         var crop = (widget.width - h_width) / 2;
         widget.setCrop(crop, 0, crop, h_bottom);
         boxee.notifyConfigChange(widget.width-(crop*2),widget.height-h_bottom);
         widget.setActive(true);
      }
   });

   if (active)
   {
      boxee.setCanPause(true);
      boxee.setCanSkip(true);
      boxee.setCanSetVolume(true);
   }

   return active;
}

boxee.onPause = function()
{
   is_paused = true;
   browser.execute(playerReference() + 'pauseVideo()')
}

boxee.onPlay = function()
{
   is_paused = false;
   browser.execute(playerReference() + 'resumeVideo()')
}

boxee.onSkip = function ()
{
   if (is_paused) return;
   update = (duration &lt; 3000) ? (current + 60) : (current + 120);
   browser.execute(playerReference() + 'seekVideo('+update+')');
}

boxee.onBigSkip = function ()
{
   if (is_paused) return;
   update = (duration &lt; 3000) ? (current + 180) : (current + 360);
   browser.execute(playerReference() + 'seekVideo('+update+')');
}

boxee.onBack = function ()
{
   if (is_paused) return;
   update = (duration &lt; 3000) ? (current - 60) : (current - 120);
   browser.execute(playerReference() + 'seekVideo('+update+')');
}

boxee.onBigBack = function ()
{
   if (is_paused) return;
   update = (duration &lt; 3000) ? (current - 180) : (current - 360);
   browser.execute(playerReference() + 'seekVideo('+update+')');
}

boxee.onSetVolume = function(volume)
{
   browser.execute(playerReference() + 'setVolume('+volume/100+')');
}</pre>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fblog.9minutesnooze.com%2Fhulu-working-boxee%2F&amp;title=Get%20Hulu%20working%20on%20Boxee%20%28again%29" id="wpa2a_10"><img src="http://blog.9minutesnooze.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.9minutesnooze.com/hulu-working-boxee/feed/</wfw:commentRss>
		<slash:comments>41</slash:comments>
		</item>
		<item>
		<title>Don&#8217;t reboot your t1.micro [EC2 epic fail]</title>
		<link>http://blog.9minutesnooze.com/reboot-t1micro-ec2/</link>
		<comments>http://blog.9minutesnooze.com/reboot-t1micro-ec2/#comments</comments>
		<pubDate>Sat, 25 Sep 2010 14:58:49 +0000</pubDate>
		<dc:creator>Aaron</dc:creator>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[amazon]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[ubuntu]]></category>

		<guid isPermaLink="false">http://blog.9minutesnooze.com/?p=248</guid>
		<description><![CDATA[If you have a t1.micro running an image of Ubuntu 10.04 LTS (Lucid Lynx), don&#8217;t reboot it. When I first wrote about t1.micros a few days ago, I forgot to mention that the first instance that I brought up failed, &#8230; <a href="http://blog.9minutesnooze.com/reboot-t1micro-ec2/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>If you have a t1.micro running an image of Ubuntu 10.04 LTS (Lucid Lynx), <strong><em>don&#8217;t reboot it.</em></strong>  When I <a href="/amazon-ec2-micro-instances-t1micro/">first wrote about t1.micros</a> a few days ago, I forgot to mention that the first instance that I brought up failed, quite catastrophically, upon reboot.  I didn&#8217;t actually think much of it at the time because I wasn&#8217;t that far into configuring the machine.  But then, yesterday, Alestic released <a href="http://alestic.com/2010/09/ec2-ami-canonical-http://alestic.com/2010/09/ec2-ami-canonical-t1micro">this note</a> referencing <a href="https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/634102">this bug report</a> saying that there is a bug where t1.micro instances running Lucid won&#8217;t come back up after a restart and that the bug has been fixed.  It&#8217;s short, so I&#8217;ll let you read it, but basically the cloud-init package was broken and didn&#8217;t properly expose the ephemeral0 device causing reboots to fail.  Alestic says that all you need to do is do an apt-get update &#038;&#038; apt-get upgrade and you&#8217;re golden.  </p>
<p>Let me tell you first hand&#8230;that doesn&#8217;t work.  This morning, feeling brave, I decided to test the theory out.  I was running a t1.micro instance using the old Canonical Ubuntu AMI ami-1634de7f on which I performed an apt-get update and an apt-get upgrade.  I saw that the cloud-init package was upgraded, as expected.  I initiated a restart and my machine never came back.  I initiated a reboot request with ec2-reboot-instances and no dice.  Finally, I stopped the instance and then started it with ec2-stop-instances and ec2-start-instances and I still didn&#8217;t have any luck.  If I were smart, I would have done this with a test instance first, but I was feeling brave and decided I should test my configuration documentation out anyhow.  Mostly, I just wanted to make sure that, if my instance was unable to reboot, it did so at a moment when I had the time and ambition to fix it instead of failing at some inopportune time.</p>
<p>Because everything is EBS backed, using an elastic IP, and my documentation is decent, I was able to detach the volumes from the old instance, attach them to the new instance, and get everything running in less than 30 minutes.  At some point when I&#8217;m feeling very ambitious, I intend to put all the configuration in <a href="http://www.puppetlabs.com">Puppet</a> to mostly automate the process of migrating to a new instance type, but I&#8217;m not quite there yet.</p>
<p>If you have a t1.micro instance running Lucid, my recommendation is to spin up a new instance with the most recent AMI (the most current AMI ID is available at <a href="alestic.com">Alestic</a>) and move everything over instead of bothering to perform the apt-get upgrade, which clearly did not work in my case.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fblog.9minutesnooze.com%2Freboot-t1micro-ec2%2F&amp;title=Don%26%238217%3Bt%20reboot%20your%20t1.micro%20%5BEC2%20epic%20fail%5D" id="wpa2a_12"><img src="http://blog.9minutesnooze.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.9minutesnooze.com/reboot-t1micro-ec2/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Kilroy Was Here (Part 2)</title>
		<link>http://blog.9minutesnooze.com/kilroy-baby-photo-part-2/</link>
		<comments>http://blog.9minutesnooze.com/kilroy-baby-photo-part-2/#comments</comments>
		<pubDate>Sun, 19 Sep 2010 16:47:09 +0000</pubDate>
		<dc:creator>Aaron</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.9minutesnooze.com/?p=231</guid>
		<description><![CDATA[There&#8217;s not much to say about this shot, except that I think it&#8217;s pretty cute. We moved and our new place has a passthrough between the kitchen and living room. My 16 month old boy has become quite the climber &#8230; <a href="http://blog.9minutesnooze.com/kilroy-baby-photo-part-2/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a title="Kilroy II" href="http://www.flickr.com/photos/59244043@N00/4931504133/"><img class="alignleft" style="float: left;" src="http://s-blog.9minutesnooze.com/wp-content/uploads/2010/09/4931504133_1dd26e0171.jpg" alt="Kilroy II" /></a>There&#8217;s not much to say about this shot, except that I think it&#8217;s pretty cute.  We moved and our new place has a passthrough between the kitchen and living room.  My 16 month old boy has become quite the climber and likes to stand on the back of the couch and watch us as we are preparing meals.  I took this shot, to complement my previous shot of him in his crib from last December <a href="/kilroy-baby-strobist-crib/">Kilroy Was Here</a>.  This time, it was candid with no additional lighting.  I used a Canon 50mm f/1.8 lens at f/2.0 and ISO 800 on aperture priority mode.  Because I was inside and he is backlit, I pumped up the exposure compensation to +2/3.</p>
<p>I didn&#8217;t manipulate the photo very much aside from some white balance, contrast adjustments, and sharpening.  Here is the original shot:</p>
<div id="attachment_234" class="wp-caption aligncenter" style="width: 310px"><a href="http://s-blog.9minutesnooze.com/wp-content/uploads/2010/09/IMG_9184_orig.jpg"><img class="size-medium wp-image-234 " title="Kilroy II (Unedited)" src="http://s-blog.9minutesnooze.com/wp-content/uploads/2010/09/IMG_9184_orig-300x199.jpg" alt="Kilroy II (Unedited)" width="300" height="199" /></a><p class="wp-caption-text">Kilroy II (Unedited)</p></div>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fblog.9minutesnooze.com%2Fkilroy-baby-photo-part-2%2F&amp;title=Kilroy%20Was%20Here%20%28Part%202%29" id="wpa2a_14"><img src="http://blog.9minutesnooze.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.9minutesnooze.com/kilroy-baby-photo-part-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Amazon EC2 Micro Instances (t1.micro)</title>
		<link>http://blog.9minutesnooze.com/amazon-ec2-micro-instances-t1micro/</link>
		<comments>http://blog.9minutesnooze.com/amazon-ec2-micro-instances-t1micro/#comments</comments>
		<pubDate>Sat, 18 Sep 2010 15:57:18 +0000</pubDate>
		<dc:creator>Aaron</dc:creator>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Linux nginx ec2 amazon systems]]></category>

		<guid isPermaLink="false">http://blog.9minutesnooze.com/?p=211</guid>
		<description><![CDATA[Amazon recently announced a new instance type &#8211; &#8220;micro instances.&#8221;  They are wicked cheap ($54 + $0.007/hr for a 1-year reserved instance + $.10/GB per month storage) and finally make Amazon accessible to the non-business user with a few low-traffic websites. &#8230; <a href="http://blog.9minutesnooze.com/amazon-ec2-micro-instances-t1micro/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Amazon recently announced a new instance type &#8211; &#8220;micro instances.&#8221;  They are wicked cheap ($54 + $0.007/hr for a 1-year <a href="http://aws.amazon.com/about-aws/whats-new/2009/03/12/amazon-ec2-introduces-reserved-instances/">reserved instance</a> + $.10/GB per month storage) and finally make Amazon accessible to the non-business user with a few low-traffic websites.  For a typical Ubuntu 10.04 LTS (Lucid) installation with a 15GB root partition, that is only $133.32 a year for your very own server in the cloud!  I&#8217;ve been with <a href="http://dreamhost.com">Dreamhost</a> for a couple years because they are inexpensive and allow shell access and &#8220;unlimited&#8221; storage*.   However, as a professional Systems Engineer, I&#8217;ve been wanting to move to something that allowed me to &#8220;own&#8221; my server.  There are many VPS (Virtual Private Server) providers out there, including <a href="http://www.dreamhost.com">Dreamhost</a> and <a href="http://www.linode.com">Linode</a> (arguably the king of Linux VPS), but they never excited me very much.  I&#8217;ll be honest and admit that I didn&#8217;t spend any time performing a detailed cost and feature analysis between the leading VPS providers, though.  My day job is working with a couple hundred EC2 instances complete with dynamic spinup and spindown for capacity, so EC2 is a comfortable environment for me.  I&#8217;ve been wanting to move into EC2 for a while, but could never justify the cost of a m1.small, though.  Last week, I dived in and have moved all of my hosting over to a t1.micro (t for tiny?) instance.</p>
<p>Here is what Amazon <a href="http://aws.amazon.com/ec2/instance-types/">has to say</a> about the new Micro Instances (t1.micro):</p>
<p>&#8220;Instances of this family provide a small amount of consistent CPU resources and allow you to burst CPU capacity when additional cycles are available. They are well suited for lower throughput applications and web sites that consume significant compute cycles periodically.</p>
<ul>
<li>Micro Instance 613 MB of memory, up to 2 ECUs (for short periodic bursts), EBS storage only, 32-bit or 64-bit platform&#8221;</li>
</ul>
<p>Amazon has a good deal of information in their <a href="http://aws.amazon.com/ec2/faqs/#How_much_compute_power_do_Micro_instances_provide">FAQ</a> and a very detailed view of usage models in their <a href="http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/index.html?concepts_micro_instances.html">User Guide</a>.</p>
<p>After a few days with this new instance type, I&#8217;ve noticed CPU time is <strong>very</strong> limited.  CPU bursts can only be very brief and it appears that you are penalized when you exceed your quota.  I run a zenphoto gallery that brought my instance to a crawl when trying to batch resize a bunch of images with ImageMagick.  It was so bad that php was unable to return simple pages before the 60 second fast cgi timeout on the nginx process.  However, with appropriate caching strategies, these machines are more than capable of running a low traffic website.  Using Apache Bench, I was able to get 1000 rpm out of the front page of this blog.  That&#8217;s with the entire application stack residing on a single machine! I will elaborate more on my configuration in a future blog post.</p>
<p>There are a couple catches with this instance type.  Storage is only EBS, which means you have to pay $0.10/GB per month above the cost of the instance time.  Also, like all hosting within Amazon,  the individual instances are completely unreliable.  You need to make sure that you can recreate your nodes from scratch at any point.  For me this means documentation, automation, monitoring, backups, and most of all keeping everything important on a separate EBS volume so it can be moved around easily in the event of an instance failure.  Even though the root partition of t1.micro instances is EBS, it is a lot easier to move data around if you don&#8217;t have to terminate the old instance before bringing up a new one.</p>
<hr />
<i>* That&#8217;s unlimited for web use &#8211; not for backups.  They noticed my 300GB of photo backups and very politely asked me to move them to a backup account and even allowed me to keep the data there for a week while I migrated it.</i></p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fblog.9minutesnooze.com%2Famazon-ec2-micro-instances-t1micro%2F&amp;title=Amazon%20EC2%20Micro%20Instances%20%28t1.micro%29" id="wpa2a_16"><img src="http://blog.9minutesnooze.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.9minutesnooze.com/amazon-ec2-micro-instances-t1micro/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Lake George panorama</title>
		<link>http://blog.9minutesnooze.com/lake-george-panorama/</link>
		<comments>http://blog.9minutesnooze.com/lake-george-panorama/#comments</comments>
		<pubDate>Sat, 11 Sep 2010 20:26:04 +0000</pubDate>
		<dc:creator>Aaron</dc:creator>
				<category><![CDATA[howididit]]></category>
		<category><![CDATA[photography]]></category>
		<category><![CDATA[sky]]></category>
		<category><![CDATA[water]]></category>
		<category><![CDATA[Panorama photoshop adirondacks photography]]></category>

		<guid isPermaLink="false">http://blog.9minutesnooze.com/?p=203</guid>
		<description><![CDATA[It was a nice weekend afternoon, so my wife and I decided to hike up Shelving Rock Mountain (along with 25 lbs of wiggling one year old on my back).  We have hiked there before, and observed a great view, &#8230; <a href="http://blog.9minutesnooze.com/lake-george-panorama/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p style="text-align: left;">
<p style="text-align: center;"><a title="Lake George from Shelving Rock Mountain" href="http://www.flickr.com/photos/59244043@N00/4963294963/"><img class="alignnone" src="http://s-blog.9minutesnooze.com/wp-content/uploads/2010/09/4963294963_0fb30b9935.jpg" alt="Lake George from Shelving Rock Mountain" /></a></p>
<p>It was a nice weekend afternoon, so my wife and I decided to hike up Shelving Rock Mountain (along with 25 lbs of wiggling one year old on my back).  We have hiked there before, and observed a great view, but missed the even better unobstructed panoramic view of Lake George and the Adirondacks only a short walk away.  Because I was carrying my son on my back, I opted to go light on the camera gear and only had a 50mm lens.  That wasn&#8217;t going to stop me from capturing the view! 27 shots (9&#215;3) and 120mp later, and I had the shot you see here.</p>
<div id="attachment_208" class="wp-caption aligncenter" style="width: 310px"><a href="http://s-blog.9minutesnooze.com/wp-content/uploads/2010/09/IMG_9499.jpg"><img class="size-medium wp-image-208  " title="Original Lake George Panorama" src="http://s-blog.9minutesnooze.com/wp-content/uploads/2010/09/IMG_9499-300x51.jpg" alt="" width="300" height="51" /></a><p class="wp-caption-text">original stitched photo</p></div>
<p>I assembled the panorama with Photoshop CS5 and then had to cut it down to 1/4 the original size in order to work with it because my machine was so bogged down.  There were a few edges that didn&#8217;t quite get filled in and content aware fill really saved the day here.  In retrospect, while it would have been nice to have had a wide lens up there, I&#8217;m glad I took the 50 because it forced me to think a little creatively to come up with the shot I wanted.</p>
<p>If you want to see some stunning panoramic photography of the Adirondack region  I suggest you check out <a href="http://www.carlheilman.com/">Carl Heilman</a>.  I have one of his works hanging on my living room wall.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fblog.9minutesnooze.com%2Flake-george-panorama%2F&amp;title=Lake%20George%20panorama" id="wpa2a_18"><img src="http://blog.9minutesnooze.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.9minutesnooze.com/lake-george-panorama/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Minified using apc (Feed is rejected)
Page Caching using apc
Database Caching using apc
Object Caching 1666/1797 objects using apc
Content Delivery Network via Amazon Web Services: CloudFront: s-blog.9minutesnooze.com

Served from: blog.9minutesnooze.com @ 2012-02-05 15:59:15 -->
