All datacenters fail. Amazon EC2 is down (edit: still down 40 hours later!!), for the 2nd time in 2 months. Rackspace, the most reliable (and most expensive) has a long history of going down. 365 Main — perhaps the most prestigious host in the Bay Area — went down in a fascinating story that I sometimes use as a brain teaser during interviews. Basically, it doesn’t matter how good your datacenter is: it’s going to go down.
And if it doesn’t, your hard drive is going to crash. Or your transformer will fry. Or some crazy process will peg all the CPUs. Or whatever. It doesn’t matter how good your hardware or software is, how fast your ops team can respond to pages, or how totally invulnerable your cloud hosting provider claims to be: it will all fail you one day, and your only response will be to frantically do nothing while waiting for someone else to fix it.
Unless, of course, you have so many datacenters that you simply don’t care.
A couple weeks back I mentioned that we put an unbelievable amount of effort into creating a secure hosting environment. A part of that that’s often overlooked is the importance of realtime redundancy. Not having spare parts on hand, and not having backups readily available, but having multiple live systems replicating in realtime at all times, such that if any of them fail, there are still plenty left over to pick up the slack. And a subtle point that is is even more often overlooked is that the minimum number of datacenters certainly isn’t one, and surprisingly isn’t two, but is actually three. Here’s why:
- One datacenter obviously isn’t enough, because if it goes down, your offline. Furthermore, if it goes down forever (or sufficiently forever to cripple your business), then that’s especially bad. Backups are ok, but you’ll still lose several hours of data — and if you’re a financial service like us, that can mean a lot of money that you don’t know where it went. So obviously, one isn’t enough.
- But two isn’t enough either. Sure, having two datacenters operating in parallel gives you confidence that if one goes down, the other will still be there: it can provide uninterrupted service without any data loss, and that’s great. But if one goes down, then you’re left with only one datacenter remaining — and we already agreed that was completely unacceptable. Remember: your servers absolutely will go down, plan on it. And there’s no rule saying that just because one of your datacenters died, the other won’t. So if you can’t ever reasonably operate with one datacenter, then you just as reasonably can’t operate with only two.
- That’s why three is the magic number. With three datacenters, you can safely lose any one, and still have two left over. Which means even if you lose one entire datacenter — and you will, with depressing regularity — you’ll still have two more, which is the minimum required to maintain a live service. (And in case you’re wondering, Expensify is programmed to go into a “safe mode” and voluntarily take itself offline in the event that it loses two datacenters — we figure if there’s some sort of global catastrophic event going on, odds are people aren’t filing expense reports.)
Granted, I can understand why people don’t do this. Most startups follow some variation of the following thought process (assuming they get that far):
- Hey, I’ve got this great idea, let me whip it up on my laptop with a local LAMP stack!
- That’s awesome, let me put it on some co-located / dedicated / virtual server.
- Whoa, it’s starting to take off, let’s get a bigger box.
- Ok, it won’t fit on the biggest box, let’s get a few webservers talking to a giant MySQL server.
- One MySQL server isn’t enough, let’s get a ton of memcache’d webservers talking to a few MySQL slaves, coming off of one giant MySQL master.
- Oh shit, my datacenter just went down. Who’s job was it to get more datacenters?
- What do you mean my dozen servers all need to be in the same datacenter on a GigE network?
- Ok, that sucked, let’s go to Rackspace.
- Wait, now Rackspace is down too?? Shit shit shit
- sleep( rand()%12 months )
- goto 9
And frankly, I think it’s a reasonable plan. Most startups don’t honestly require serious availability or durability because most startups don’t deal with hyper-sensitive financial data like us. However, as much as I’d like to claim we did all this out of some commitment to engineering excellence, I’ve got to admit: we did it because we had no choice. Our thought process was:
- Hey, I’ve got this great idea, let me go ask a bank if we can do it!
- They said no. Hm…
- Repeat steps 1-2 about 20 times
- Ok, I’ve found someone who will let us do it, but we’ve got totally insane security and uptime requirements
- Research… research…
- Ok, MySQL doesn’t really do what we need
- Research… research…
- Hell, nothing does what we need, at least at startup prices.
- Well, luckily we’re P2P experts so let’s just write a custom PCI-compliant WAN-optimized distributed transaction layer.
- Wow, that was hard, but I’m glad we’re done. Let’s launch!
- Things are taking off, let’s upgrade all our datacenters one by one… cool! No downtime!
- Oh crap, that hard drive died in the middle of the night. But cool, no downtime!
- Our datacenters are too far apart and the latency is affecting replication times; let’s replace them one by one with lower-latency alternatives. Cool, no downtime!
- One of our datacenters seems to be flaking out every once in a while… screw it, let’s just replace it. Cool, no downtime!
And so on. Not to say that we don’t have problem, as we obviously do. But I can’t overstate how comforting such a highly-redundant system, and I hope it provides a degree of comfort to you.