ANALYZE TABLE is replicated. RTFM.

Sometimes, I make mistakes. It’s true. That can be difficult for us Systems Engineering-types to say, but I try to distance myself from my ego and embrace the mistakes because I often learn the most from them. ..Blah, blah, school of hard knocks, blah, blah…. Usually my mistakes aren’t big enough to cause any visible impact, but this one took the site out for 10 minutes during a period of peak traffic due to a confluence of events.

Doh!

Here is how it went down…

We have an issue where MySQL table statistics are occasionally getting out of whack, usually after a batch operation. This causes bad explain plans, which in turn cause impossibly slow queries. An ANALYZE TABLE (or even SHOW CREATE INDEX) resolves the issue immediately, but I prefer not get woken up at 4AM by long running query alerts when my family and I are trying to sleep. As a way to work around the issue, we decided to disable InnoDB automatic statistic calculations by setting innodb_stats_auto_update=0. Then, we would run ANALYZE TABLE daily (via cron) during a low traffic period to force MySQL to update table statistics. This creates more stable and predictable query execution plans and reduces the number of places where we have to add explicit USE/FORCE/IGNORE INDEX clauses in the code to work around the query optimizer.

To accomplish this, I wrote a very simple shell script that runs ANALYZE TABLE against all InnoDB tables. After testing it in a non-production environment, it was pushed out to our passive (unused) master database with puppet. Because it was going to execute in the middle of the night for the first time, I decided to run it by hand once on our passive master database just to make sure everything was kosher. Call me a wimp, but I don’t like getting up in the middle of the night because my script took the site down (see comment about family and sleeping). We run our master/master databases in active/passive mode, so testing this on the passive server was a safe move.

Theoretically.

A little background on ANALYZE TABLE on InnoDB tables: All it really does is force a recalculation of table statistics and flush the table. A read lock is held for the duration of the statement, so you want to avoid running this on a customer-facing server that is taking traffic. Because the table is flushed, the next thread that needs to access the table will have to open it again. On our servers with FusionIO cards, it takes about 5 seconds to run ANALYZE TABLE on over 250 tables. All this was fine in Myopia City, because I was running this on the passive server.

Meanwhile, in another zip code, someone was testing out a SELECT against a production data set…

While I was testing my ANALYZE TABLE script, I receive an ominous, “yt?” message in Skype.

(Sidebar: In the history of Operations, has an engineer ever received a “yt?” message that lead to something awesome? Like, “yt? We’re going to send you a batch of fresh baked cookies every day for the next month.” That never happens.)

So, now I’m in a call. SITE DOWN! OMFGWTFBYOB!!! (No, it wasn’t like that. Really, we’re pretty cool-headed about stuff like this). This outage appeared to be database related. I logged in and checked the process list to see what was running:

mysql> SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST WHERE INFO <> 'NULL' ORDER BY TIME;
*************************** 1. row ***************************
           ID: 19210373
         USER: me
         HOST: localhost
           DB: production
      COMMAND: Query
         TIME: 0
        STATE: executing
         INFO: SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST WHERE INFO <> 'NULL' ORDER BY TIME
      TIME_MS: 0
*************************** 2. row ***************************
           ID: 19210713
         USER: user
         HOST: 10.x.x.x:59900
           DB: production
      COMMAND: Query
         TIME: 1
        STATE: Waiting for table
         INFO: SELECT * FROM `table` WHERE (`table`.`l_id` IN (3,11,15,7)) AND (`table`.s_id = 1234)
      TIME_MS: 1474
*************************** 3. row ***************************
           ID: 19154978
         USER: user
         HOST: 10.x.x.x:45915
           DB: production
      COMMAND: Query
         TIME: 1
        STATE: Waiting for table
         INFO: SELECT count(*) AS count_all FROM `table` WHERE (`table`.sku_id = 2345)
      TIME_MS: 3737

180 more queries in "Waiting for table" state
*************************** 181. row ***************************
           ID: 19203223
         USER: user
         HOST: 10.x.x.x:34299
           DB: production
      COMMAND: Query
         TIME: 607
        STATE: Waiting for table
         INFO: SELECT * FROM `table` WHERE (`table`.s_id = 4567)
      TIME_MS: 606530
*************************** 182. row ***************************
           ID: 19203223
         USER: user
         HOST: 10.x.x.x:34299
           DB: production
      COMMAND: Query
         TIME: 607
        STATE: Waiting for table
         INFO: SELECT * FROM `table` WHERE (`table`.s_id = 4567)
      TIME_MS: 606530
*************************** 182. row ***************************
           ID: 19198325
         USER: user
         HOST: 10.x.x.x:56399
           DB: production
      COMMAND: Query
         TIME: 712
        STATE: Sending data
         INFO: SELECT RUN_LONG_TIME FROM `table`
      TIME_MS: 711545

view raw gistfile1.sql This Gist brought to you by GitHub.

(queries modified to protect the guilty)

That’s…strange. The RUN_LONG_TIME query seems to be blocking all the other queries on that table. But it’s just a SELECT. I looked at SHOW ENGINE INNODB STATUS and it didn’t have anything interesting in it. There were no row or table locks, no UPDATE/INSERT/DELETE, or SELECT FOR UPDATE queries, and innodb_row_lock_waits was not incrementing. A colleague noted that there were a lot of entries in the MySQL error log, so I looked at that and found (amongst the clutter):

83109 production.table Locked - write High priority write lock
83109 production.table Locked - read Low priority read lock

view raw gistfile1.txt This Gist brought to you by GitHub.

We were in an outage and the most important thing at this point was to resume selling shoes, dresses, and lingerie, so I collected as much data as I could for later review, dumped it into Evernote and killed the RUN_LONG_TIME query. Bam, the queries in “Waiting for table” state finished and the site came back online. Had that not solved the problem, another team member had his finger on the “fail over to the other server” button.

Outage over. Phew.

But, as my toddler likes to say — “What just happened?” The RUN_LONG_TIME query was expensive, but it shouldn’t have been blocking other queries from completing. First step, I went to a reporting server and tried to reproduce it:

session1> SELECT RUN_LONG_TIME FROM table;
session2> SELECT * FROM table WHERE id = 123

view raw gistfile1.sql This Gist brought to you by GitHub.

All copasetic. What’s next, chief?

Time to look at some graphs. Because we have the complete output of SHOW GLOBAL STATUS logging to Graphite every few seconds, it is easy see what the server is doing at any given time. (You should do that, too. It’s incredibly valuable.) I started poking around at the charts on the active server and noticed a few oddities:

There was a lot of InnoDB buffer pool activity – several graphs looked like this:

That made sense, as the RUN_LONG_TIME query was sifting through a lot of data. A lot of data. A lot. 14 quadrillion rows, in my estimate.

After seeing that pattern across a number of other stats, I started poking through the Com_* variables. Com_analyze looked like this:

What fool ran ANALYZE TABLE a bunch of times at peak traffic on the active database!? This is where I contracted a case of the RTFMs. As it turns out, ANALYZE TABLE statements are written to the binary log and thus replicated unless you supply the LOCAL key word (ANALYZE LOCAL TABLE).

I had not supplied that keyword.

As a result of my missing keyword, the ANALYZE TABLE statements replicated to the active server during peak traffic periods while a very long running query was in progress. Intuitively that still shouldn’t have caused this behavior. ANALYZE TABLE takes less than a second on each table. But that isn’t the whole story…

Back to the reporting server to attempt to reproduce the behavior:

session1> SELECT RUN_LONG_TIME FROM table;
session2> ANALYZE TABLE table;
session3> SELECT * FROM table WHERE id=123;

view raw gistfile1.sql This Gist brought to you by GitHub.

The statement in session3 hung and was in “Waiting for table” status. Success (at failure)!

What happened is the ANALYZE TABLE flushed the table, which tells InnoDB to close all references before allowing access again. Because there was a query running while ANALYZE TABLE was executing, MySQL had to wait for the query to complete before allowing access from another thread. Because that query took so long, everything else hung out in “Waiting for table” state. The documentation on this point sort of explains the issue, though it is a little muddy:

The thread got a notification that the underlying structure for a table has changed and it needs to reopen the table to get the new structure. However, to reopen the table, it must wait until all other threads have closed the table in question.

This notification takes place if another thread has used FLUSH TABLES or one of the following statements on the table in question: FLUSH TABLES tbl_name, ALTER TABLE, RENAME TABLE, REPAIR TABLE, ANALYZE TABLE, or OPTIMIZE TABLE.

I explained the sequence of events and root cause to our team and also publicly flogged myself a bit. As it turns out, this issue only happened because of the combination of two different events happening simultaneously. The ANALYZE TABLE alone wouldn’t have been a big deal had there not also been a very long running query going at the same time.

I have a few take-aways from this:

  • If you make a mistake, fess up. That’s a lot better than covering it up and having someone find out about it later. People understand mistakes.
  • Mistakes are the best chances for learning. I can assure you, that I will never, ever forget that ANALYZE TABLE writes to the binary log.
  • Measure everything that you can, always. Without the output of SHOW GLOBAL STATUS being constantly charted in Graphite, I would have been blind to any abnormalities.
  • During an outage, resist the temptation to just “fix it” before grabbing data to analyze later. Pressure is on and getting things running is very high priority, but it is even worse if you fix the problem, don’t know why it occurred, and end up in the same situation again a week later.
  • Try not to perform seemingly innocuous tasks on production servers at peak times.
  • RTFM. Always. Edge cases abound in complex software.
Share

RAID 10 your EBS data

When I spoke at Percona Live (video here) on running an E-commerce database in Amazon EC2, I briefly talked about using RAID 10 for additional performance and fault tolerance when using EBS volumes. At first, this seems counter intuitive. Amazon has a robust infrastructure, EBS volumes run on RAIDed hardware, and are mirrored in multiple availability zones. So, why bother? Today, I was reminded of just how important it is. Please note that all my performance statistics are based on direct experience running a MySQL database on a m2.4xlarge instance and not on some random bonnie or orion benchmark. I have those graphs floating around on my hard drive in glorious 3D and, while interesting, they do not necessarily reflect real-life performance.

Why? Part 1. Performance

Let’s get to the point. EBS is cool and very very flexible, but nominal performance is poor and highly variable with average latencies (svctime in iostat) in the 2-10ms range . At its heart, EBS is Network Attached Storage and shares bandwidth with your instance NIC. At best, I see 1.5ms svctime and 10ms await, and at worst…well, at worst you don’t need ms precision to measure it. On top of that, a single EBS volume seems to peak out at around 100-150 iops, which is about what one would expect from a single SATA drive. That’s fine if you’re running a low-traffic website with very little disk activity, but once the requests start to come in, things get a little squirrelly. Add in multi-tenancy and a noisy neighbor can really beat your disk into submission.

So, what’s a lowly Systems Engineer to do when the iowait time starts to pile up? Well, it turns out that those IOPs are initially bound by the disk on the backend and not local NIC traffic, so you can use Linux Software RAID to significantly improve the I/O capacity of your disk (but not the latency or variability…more on this later). For a performance boost, there is a lot of bad advice on the Internet saying you should RAID 0 your disk (because “it’s redundant on the back end”), but to the the discriminating SysEng, that should scream bad idea.

Why? Part 2. Redundancy

Right, so EBS is RAIDed and mirrored in multiple availability zones on the back end, so why do I need to worry about redundancy? That’s great and all, but with the EBS cool factor comes additional complexity and new and unexpected failure modes. The first and most obvious was #ec2pocalypse, otherwise known as the Great Reddit Fail of 2011. If you’re not aware of what happened (and the details are somewhat irrelevant), but a couple months back someone pressed the wrong button at Amazon and a significant percentage of EBS volumes became “stuck” showing 100% utilization and no iops. This failure lasted several days and took out a large number of websites that based their infrastructure on EBS. Most of the data itself was recovered, but a small percentage of people were SOL. So much for redundancy.

Enter RAID10. Yes, it’s slower than RAID0 because you have to write twice. Yes, you are bound by the worst performing disk in the array. But, you can get nearly 1:1 increase in IOPs (up to a point) and gain the ability to recover your data when Amazon drops the ball.

You need proof? “Give me an example,” you say? Let’s talk about what happened to me today. Everything was just peachy all day – performance was within parameters and then at 3:15PM, all of a sudden the database started having random query pile ups. Being in EC2, this was not unexpected, but it kept happening. Traffic was on a decline, but we were expecting big traffic in an hour or so. So, I started looking at the disk. We have a 10-drive RAID10 array on our master DB and 1 of those disks was showing svctime in the 30-100ms range, vs 2-10ms on all the others. BINGO!

I didn’t save the actual iostat output, but sar showed this:

03:15:01 PM DEV       tps avgqu-sz  await svctm %util
03:35:01 PM dev8-133 7.78     0.11  13.49  2.28  1.77
03:35:01 PM dev8-130 6.54     0.09  14.14  2.27  1.48
03:35:01 PM dev8-149 8.34     0.11  12.62  2.08  1.74
03:35:01 PM dev8-132 7.67     0.10  13.29  1.98  1.52
03:35:01 PM dev8-131 8.66     0.11  12.27  1.91  1.65
03:35:01 PM dev8-147 7.13     0.10  13.77  2.13  1.52
03:35:01 PM dev8-129 7.58     0.08  10.56  1.73  1.31
03:35:01 PM dev8-148 8.47     4.30 506.96 54.77 46.36
03:35:01 PM dev8-146 8.17     0.08   9.28  1.38  1.13
03:35:01 PM dev8-145 6.70     0.26  39.36  6.87  4.60

dev8-148 sure looks fishy, eh? (Oh, side note…to align this data all pretty-like, I used the aptly named align, a great tool from the Aspersa Toolkit)

Had this been a single volume EBS or RAID0 volume, we would have been forced to perform a database failover to a secondary master and redirect the application, which would have interrupted sales briefly during an active time. Instead, thanks to RAID10, we have options. Instead of a failover during a period of relatively high traffic, we simply failed out the problem drive. Now we were running on 9 drives and with reduced redundancy, but performance immediately recovered and the stalls stopped. We can replace the drive later and resync the array when traffic is low.

How?

First, you need to create and attach “a bunch” of volumes to your instance. How many? I’ve seen diminishing returns after 8-10 disks, but your mileage (and instance size) may vary. Typical RAID10 rules apply here…you need 2x the total capacity and each disk has to equal 2*(capacity)/(num disks), so if you need 1TB usable and want to use 8 disks, you will need each disk to be 256GB.

Here’s some code to do that. It creates 8x256GB volumes in the us-east-1a zone and then attaches them to instance i-1a2b3c4d

for x in {1..8); do \
  ec2-create-volume --size 256 --zone us-east-1a; \
done > /tmp/vols.txt

(i=0; \
for vol in $(awk '{print $2}' /tmp/vols.txt); do \
  i=$(( i + 1 )); \
  ec2-attach-volume $vol -i i-1a2b3c4d -d /dev/sdh${i}; \
done)

Then, you need to install Linux Software RAID. On Debian or Ubuntu:
apt-get install mdadm

Then, create a new RAID 10 (-l10) volume from 8 disks (-n8):
mdadm --create -l10 -n8 /dev/md0 /dev/sdh*

With any luck, you’ll get a message saying that the array was started. You can verify this by looking at /proc/mdstat and you should see something like this (the numbers in this example are probably off. I pulled them together from some random machines)

cat /proc/mdstat
Personalities : [raid10]
md0 : active raid10 sdh6[5] sdh5[4] sdh4[3] sdh3[2] sdh2[1] sdh1[0]
      1048575872 blocks 64K chunks 2 near-copies [6/6] [UUUUUU]
      [==>..................]  resync = 13.3% (431292736/3221225280) finish=7721.9min speed=6021K/sec

Your disk will spend a lot of time and IOPs resyncing, but you can format /dev/md0 and mount it right away.

This wasn’t meant as a complete guide to Linux Software RAID – if you want to know more, check out The Software-RAID HOWTO.

The Bad

Ok, so the observant among you will realize that by having 8 or 10 disks in the array, all with the potential to have severe performance degradation like this, I have drastically increased the variability of latency. Well, you would be right, but…

  1. I can’t get IOPs any other way in EC2
  2. It is easy to recover from the most common failure mode with this setup
  3. If you care about your data at all, RAID0 (or no RAID) is doing it wrong

Remember, kids…Friends don’t let friends RAID0.

Share

Don’t reboot your t1.micro [EC2 epic fail]

If you have a t1.micro running an image of Ubuntu 10.04 LTS (Lucid Lynx), don’t reboot it. When I first wrote about t1.micros a few days ago, I forgot to mention that the first instance that I brought up failed, quite catastrophically, upon reboot. I didn’t actually think much of it at the time because I wasn’t that far into configuring the machine. But then, yesterday, Alestic released this note referencing this bug report saying that there is a bug where t1.micro instances running Lucid won’t come back up after a restart and that the bug has been fixed. It’s short, so I’ll let you read it, but basically the cloud-init package was broken and didn’t properly expose the ephemeral0 device causing reboots to fail. Alestic says that all you need to do is do an apt-get update && apt-get upgrade and you’re golden.

Let me tell you first hand…that doesn’t work. This morning, feeling brave, I decided to test the theory out. I was running a t1.micro instance using the old Canonical Ubuntu AMI ami-1634de7f on which I performed an apt-get update and an apt-get upgrade. I saw that the cloud-init package was upgraded, as expected. I initiated a restart and my machine never came back. I initiated a reboot request with ec2-reboot-instances and no dice. Finally, I stopped the instance and then started it with ec2-stop-instances and ec2-start-instances and I still didn’t have any luck. If I were smart, I would have done this with a test instance first, but I was feeling brave and decided I should test my configuration documentation out anyhow. Mostly, I just wanted to make sure that, if my instance was unable to reboot, it did so at a moment when I had the time and ambition to fix it instead of failing at some inopportune time.

Because everything is EBS backed, using an elastic IP, and my documentation is decent, I was able to detach the volumes from the old instance, attach them to the new instance, and get everything running in less than 30 minutes. At some point when I’m feeling very ambitious, I intend to put all the configuration in Puppet to mostly automate the process of migrating to a new instance type, but I’m not quite there yet.

If you have a t1.micro instance running Lucid, my recommendation is to spin up a new instance with the most recent AMI (the most current AMI ID is available at Alestic) and move everything over instead of bothering to perform the apt-get upgrade, which clearly did not work in my case.

Share

Amazon EC2 Micro Instances (t1.micro)

Amazon recently announced a new instance type – “micro instances.”  They are wicked cheap ($54 + $0.007/hr for a 1-year reserved instance + $.10/GB per month storage) and finally make Amazon accessible to the non-business user with a few low-traffic websites.  For a typical Ubuntu 10.04 LTS (Lucid) installation with a 15GB root partition, that is only $133.32 a year for your very own server in the cloud!  I’ve been with Dreamhost for a couple years because they are inexpensive and allow shell access and “unlimited” storage*.   However, as a professional Systems Engineer, I’ve been wanting to move to something that allowed me to “own” my server.  There are many VPS (Virtual Private Server) providers out there, including Dreamhost and Linode (arguably the king of Linux VPS), but they never excited me very much. I’ll be honest and admit that I didn’t spend any time performing a detailed cost and feature analysis between the leading VPS providers, though. My day job is working with a couple hundred EC2 instances complete with dynamic spinup and spindown for capacity, so EC2 is a comfortable environment for me.  I’ve been wanting to move into EC2 for a while, but could never justify the cost of a m1.small, though.  Last week, I dived in and have moved all of my hosting over to a t1.micro (t for tiny?) instance.

Here is what Amazon has to say about the new Micro Instances (t1.micro):

“Instances of this family provide a small amount of consistent CPU resources and allow you to burst CPU capacity when additional cycles are available. They are well suited for lower throughput applications and web sites that consume significant compute cycles periodically.

  • Micro Instance 613 MB of memory, up to 2 ECUs (for short periodic bursts), EBS storage only, 32-bit or 64-bit platform”

Amazon has a good deal of information in their FAQ and a very detailed view of usage models in their User Guide.

After a few days with this new instance type, I’ve noticed CPU time is very limited. CPU bursts can only be very brief and it appears that you are penalized when you exceed your quota.  I run a zenphoto gallery that brought my instance to a crawl when trying to batch resize a bunch of images with ImageMagick. It was so bad that php was unable to return simple pages before the 60 second fast cgi timeout on the nginx process.  However, with appropriate caching strategies, these machines are more than capable of running a low traffic website. Using Apache Bench, I was able to get 1000 rpm out of the front page of this blog. That’s with the entire application stack residing on a single machine! I will elaborate more on my configuration in a future blog post.

There are a couple catches with this instance type. Storage is only EBS, which means you have to pay $0.10/GB per month above the cost of the instance time.  Also, like all hosting within Amazon, the individual instances are completely unreliable. You need to make sure that you can recreate your nodes from scratch at any point. For me this means documentation, automation, monitoring, backups, and most of all keeping everything important on a separate EBS volume so it can be moved around easily in the event of an instance failure. Even though the root partition of t1.micro instances is EBS, it is a lot easier to move data around if you don’t have to terminate the old instance before bringing up a new one.


* That’s unlimited for web use – not for backups.  They noticed my 300GB of photo backups and very politely asked me to move them to a backup account and even allowed me to keep the data there for a week while I migrated it.

Share

Mac Mini Media Center/HTPC

Mac Mini & RemoteChristmas came a little early for me this year and I bought myself a Mac Mini and accessories to be used as a media center for my living room television. I have been wanting to build a home media center/HTPC for some time now and have hemmed and hawed over it. My basic requirements were that it would run something like Boxee, be easily administered remotely via SSH (read: UNIX), support Netflix and Hulu, and be usable by my non-technical (read: doesn’t work with computers for a living) wife. I basically wanted a set-it and forget-it machine that could be run from the couch with a simple remote control. Windows was not invited to my party, but for others would be completely capable for this task.

Mac Mini in Media Cabinet

Mac Mini in Media Cabinet

Originally, my plan was to purchase one of the many inexpensive ($200-300) media center PCs available or something like the Dell Zino, which along with Linux, seemed like a perfect solution. However, after much research and with the advice of some helpful coworkers, I learned that Netflix doesn’t work under Linux. There may be some hacks out there to get it working, but honestly my day job is configuring and administering clusters of Linux machines and I really didn’t feel like giving myself headaches for my television. So, after much deliberation I mentally justified the Apple tax and ponied up for a Mac Mini (with Snow Leopard). Ok, it wasn’t that hard, given that I already own a 2006 15″ MacBook Pro, a 24″ iMac, and use a 15″ MacBook Pro at work. Somehow, I turned into an Apple Fanboy over the last couple years.

The configuration described below is the one I settled on based on my existing TV and sound system (both low-end, but adequate for me). I’ve provided a bunch of links down below that go into more elaborate setups, including using the Mini as an over-the-air HD tuner and DVR.

Hardware:

  • Mac Mini (MC238LL) [$599.00]
  • The Mac Mini is a late-2009 model, with a 2.26 GHz Intel Core 2 Duo processor, 160GB Internal HD, and 2GB of RAM…the most basic model they offer. Because I have a corporateperks.com account, I was able to get this for $563.00. Amazon has them for a little cheaper [$574.95 and no sales tax] than the Apple Store.

  • Apple Remote (MC377LL) [$19.00]
  • It’s all aluminum and very sexy. Because of the corporateperks account, I was able to get this for $17.00.

  • Western Digital Elements 1.5TB External HD (WDBAAU0015HBK) [$99.99 via Holiday sale, regularly $119.99]

Cables/Adapters:

I stupidly purchased a MiniDVI->DVI adapter so I could plug my monitor into the Mac Mini for setup, but the Mac Mini comes with this adapter already. I was only a few bucks, but still…

Software:

Remote Control:

Apple RemoteI chose to use the standard Apple Remote control since my primary usage of this machine will be to run Boxee and it doesn’t require a lot of functionality to use to the fullest. The Apple Remote has 7 buttons – up, down, left, right, center, Menu, and Play/Pause. If you want to completely live in the Boxee (or Plex) world, this is all you need, really. However, I wanted to be able to start a few applications, have a virtual mouse, and perform a few other system-related tasks without the assistance of SSH or VNC, so I installed Remote Buddy. It extends the functionality of the remote – you just hold the menu button for a second or so and a separate menu pops up that allows you to perform all sorts of tasks (called “Behaviours” in the Remote Buddy world…yes, they are British) such as opening applications, rebooting the system, adjusting the volume, and even operating the mouse cursor with the remote control. These functions are very helpful particularly when Boxee crashes (which it seems to do quite frequently). Remote Buddy has built in actions for many common media center applications, including Boxee, Plex, VLC, and even Firefox.

The Apple Remote supposedly doesn’t work very well with Snow Leopard, according to various reports and the Plex startup screen. Not wanting to learn the hard way, I just installed the recommended Candelair IR driver. This replaces the OSX IR Receiver driver and seems to work just great. I believe this was addressed in a Snow Leopard Service Pack (10.6.2), but I haven’t bothered testing since the Candelair driver works well, is free, and is made by the same people who make Remote Buddy.

Remote Access:

Screen Sharing Settings

Screen Sharing Settings

For remote access, I use a combination of SSH and VNC. Because I have a weak wireless 802.11g connection in the living room, the built-in Apple Screen Sharing.app wasn’t connecting properly to the Mac Mini. After a good deal of troubleshooting, I came to the conclusion that it was a client-based problem and not the fault of the built-in VNC server on the Mac Mini. Apple Screen Sharing is simply an extension on the VNC protocol, so I tried a number of VNC applications – JollysFastVNC was the best and even supported BonJour. I had to dial down the Color Depth to 16 bit for things to work, but now it runs reasonably smooth. To get JollysFastVNC to pass along all special characters (such as Cmd-Tab), I had to go to System Preferences->Universal Access and check “Enable access for assistive devices” on the Mini. On the client, I had to set Keyboard input to Immersive in JollysFastVNC. Now, VNCing to the Mini is mostly seamless, though still kind of slow due to my poor wireless signal.

To enable Screen Sharing, SSH, and File Sharing, go into Apple Menu->System Preferences->Sharing and check off Screen Sharing, File Sharing, and Remote login. Make sure to apply the permissions most relevant to your setting. It’s conveniences like this that lead me down the Mac Mini path versus a Linux-based solution.

Storage:

The 160GB local disk included with the Mac Mini was simply not enough for a media center storing 720p and 1080p HD content. I looked into several options including the super-slick miniStack which is the same form factor as the Mini, but ultimately I decided that the form factor and faster hard drive was just not important enough to justify the extra expense. A co-worker sent me a deal at Dell.com for a bare-bones Western Digital Elements 1.5TB USB drive for $99 and I jumped on it. It is quiet and fast enough for me. Additionally, it doesn’t have any lights on it, so it is stealthy in my media cabinet.

Configuration:

Really, there was very little configuration involved. The Mini correctly identified my video resolution and looked great on the TV. All I had to do to get things working was plug everything in and install the software. To make sure everything started on boot, I went into the System Preferences->Accounts->Login Items pane and added Remote Buddy and Boxee as Login items. Now, when I restart the computer, everything comes up ready to go. I also enabled Automatic login in the Accounts->Login Options preferences pane.

By default, the new Play/Pause button and the Select (middle) button on the new Apple Remote seem to have the exact same behavior. This was annoying in Boxee because I had to click twice to pause running media. After whining about it (and originally including it in the “Problems” section below), I discovered that Remote Buddy allows very granular control over the function of every button. I went into Remote Buddy->Preferences->Mapping and under “Behaviours”->Boxee, I set Play/Pause to the Pause action. This had the effect of working as both a Pause and an Un-pause button when watching media in Boxee. Problem solved!

Problems:

  • Fast Forward/Rewind Media – It’s difficult to fast forward or rewind media. Local media skips ahead at least 1 minute (or 10 minutes if you use the second of the two fast forward options in Boxee), but to smoothly fast-forward or skip ahead only a few seconds doesn’t seem to work very well. In streaming environments, such as Hulu and Netflix, fast-forwarding and rewinding is unreliable at best and just plain doesn’t work sometimes.
  • Boxee crashes. A lot. Mostly when using Pandora. It can be kind of annoying, but on the other hand, Boxee is free and still in Alpha. The beta is supposed to be released to the public on January 7, 2010 and I am anxious to give it a try.

Resources:

Share