All datacenters fail. Amazon EC2 is down (edit: still down 40 hours later!!), for the 2nd time in 2 months. Rackspace, the most reliable (and most expensive) has a long history of going down. 365 Main — perhaps the most prestigious host in the Bay Area — went down in a fascinating story that I sometimes use as a brain teaser during interviews. Basically, it doesn’t matter how good your datacenter is: it’s going to go down.
And if it doesn’t, your hard drive is going to crash. Or your transformer will fry. Or some crazy process will peg all the CPUs. Or whatever. It doesn’t matter how good your hardware or software is, how fast your ops team can respond to pages, or how totally invulnerable your cloud hosting provider claims to be: it will all fail you one day, and your only response will be to frantically do nothing while waiting for someone else to fix it.
Unless, of course, you have so many datacenters that you simply don’t care.
A couple weeks back I mentioned that we put an unbelievable amount of effort into creating a secure hosting environment. A part of that that’s often overlooked is the importance of realtime redundancy. Not having spare parts on hand, and not having backups readily available, but having multiple live systems replicating in realtime at all times, such that if any of them fail, there are still plenty left over to pick up the slack. And a subtle point that is is even more often overlooked is that the minimum number of datacenters certainly isn’t one, and surprisingly isn’t two, but is actually three. Here’s why:
- One datacenter obviously isn’t enough, because if it goes down, your offline. Furthermore, if it goes down forever (or sufficiently forever to cripple your business), then that’s especially bad. Backups are ok, but you’ll still lose several hours of data — and if you’re a financial service like us, that can mean a lot of money that you don’t know where it went. So obviously, one isn’t enough.
- But two isn’t enough either. Sure, having two datacenters operating in parallel gives you confidence that if one goes down, the other will still be there: it can provide uninterrupted service without any data loss, and that’s great. But if one goes down, then you’re left with only one datacenter remaining — and we already agreed that was completely unacceptable. Remember: your servers absolutely will go down, plan on it. And there’s no rule saying that just because one of your datacenters died, the other won’t. So if you can’t ever reasonably operate with one datacenter, then you just as reasonably can’t operate with only two.
- That’s why three is the magic number. With three datacenters, you can safely lose any one, and still have two left over. Which means even if you lose one entire datacenter — and you will, with depressing regularity — you’ll still have two more, which is the minimum required to maintain a live service. (And in case you’re wondering, Expensify is programmed to go into a “safe mode” and voluntarily take itself offline in the event that it loses two datacenters — we figure if there’s some sort of global catastrophic event going on, odds are people aren’t filing expense reports.)
Granted, I can understand why people don’t do this. Most startups follow some variation of the following thought process (assuming they get that far):
- Hey, I’ve got this great idea, let me whip it up on my laptop with a local LAMP stack!
- That’s awesome, let me put it on some co-located / dedicated / virtual server.
- Whoa, it’s starting to take off, let’s get a bigger box.
- Ok, it won’t fit on the biggest box, let’s get a few webservers talking to a giant MySQL server.
- One MySQL server isn’t enough, let’s get a ton of memcache’d webservers talking to a few MySQL slaves, coming off of one giant MySQL master.
- Oh shit, my datacenter just went down. Who’s job was it to get more datacenters?
- What do you mean my dozen servers all need to be in the same datacenter on a GigE network?
- Ok, that sucked, let’s go to Rackspace.
- Wait, now Rackspace is down too?? Shit shit shit
- sleep( rand()%12 months )
- goto 9
And frankly, I think it’s a reasonable plan. Most startups don’t honestly require serious availability or durability because most startups don’t deal with hyper-sensitive financial data like us. However, as much as I’d like to claim we did all this out of some commitment to engineering excellence, I’ve got to admit: we did it because we had no choice. Our thought process was:
- Hey, I’ve got this great idea, let me go ask a bank if we can do it!
- They said no. Hm…
- Repeat steps 1-2 about 20 times
- Ok, I’ve found someone who will let us do it, but we’ve got totally insane security and uptime requirements
- Research… research…
- Ok, MySQL doesn’t really do what we need
- Research… research…
- Hell, nothing does what we need, at least at startup prices.
- Well, luckily we’re P2P experts so let’s just write a custom PCI-compliant WAN-optimized distributed transaction layer.
- Wow, that was hard, but I’m glad we’re done. Let’s launch!
- Things are taking off, let’s upgrade all our datacenters one by one… cool! No downtime!
- Oh crap, that hard drive died in the middle of the night. But cool, no downtime!
- Our datacenters are too far apart and the latency is affecting replication times; let’s replace them one by one with lower-latency alternatives. Cool, no downtime!
- One of our datacenters seems to be flaking out every once in a while… screw it, let’s just replace it. Cool, no downtime!
And so on. Not to say that we don’t have problem, as we obviously do. But I can’t overstate how comforting such a highly-redundant system, and I hope it provides a degree of comfort to you.
Three isn’t the right answer either. Each separate “datacenter” you add asymptotically brings you closer to a 100% probability of being available but you will never reach it. Consequently it becomes a business (money, customers) and technical (complexity) question to pick the right answer.
You also haven’t mentioned that running the same software in all data centers is also a major risk factor. For example there may be a particularly formatted transaction that just happens to crash your software, database or operating system. And it would do the same to all data centers. Or crashes Cisco switches/routers which most data centers use. You could of course independently develop the software multiple times, sandbox components, require data centers use different manufacturers etc.
At the end of the day there will always be a weakest spot. It can be moved around but will always be there. What is noticable are those companies that have no plan to communicate or react when the worst does happen.
Roger – Totally agree that 100% uptime is unachievable, and that the “right answer” depends on your context.
I think one datacenter is completely fine for most businesses who are ok restoring from backups and losing several hours of data (and frankly, most are). But real financial systems can’t “restore bank accounts from backup” — once the money moves, it’s moved, and if you don’t remember where you put it, things get really bad. Accordingly, if the prospect of a few hours of data loss is unacceptable, then the only real solution is to commit in realtime to an offsite location: so a minimum of two datacenters.
Similarly, most businesses can handle hours of downtime without serious catastrophic damage to the business; even most financial systems don’t really require more than two datacenters because if one goes down, the other can just go into a “safe mode” until the other comes back up. But if you’re a realtime financial system like a payment card processor (which is what we initially designed for) then two also isn’t enough because you need to keep running even when you inevitably lose a datacenter. So I feel if you’re running a payment processing system, even two isn’t enough: you really need three.
Though your’e right, “need” is all relative — plenty of payment processors exist with two or even one datacenter, and they get by. They probably have a much larger staff, 24/7 NOCs, and skip cloud hosting entirely to control the physical configuration of the hardware — spending a lot more in the process. And even then, sometimes they go down, sometimes they lose data (or reassemble it from partners systems that didn’t fail), etc.
So realistically, if you were going to write your own expense report system from scratch, our infrastructure would probably be overkill.
But since we already have it, boy do we sleep soundly.
You mean you actually care about using real uptime using multiple high availability systems? Holy crap! I’m not sure if I want to give you a hug or weep with joy.
I can’t really say where I’ve worked that I’ve seen this, but having single points of failure everywhere including critical systems seems like a normal situation. I can’t count the number of times I’ve been woken up at odd hours of the night because just one of those critical points went down.
Love your 100% perfectly clean site (and also the service idea)! Just a note about data center redundancy (more for your readers, as you probably know well enough anyway, what I’m about to say):
“But two isn’t enough either. … if one goes down, the other will still be there … But if one goes down, then you’re left with only one datacenter remaining — and we already agreed that was completely unacceptable. Remember: your servers absolutely will go down, plan on it. And there’s no rule saying that just because one of your datacenters died, the other won’t.”
Well, going further with this exact same induction logic (of saying “one is not enough, then two is not enough, because two will soon enough become one, and we already said one is not enough…”, you should continue saying three is not enough either, because three can become two equally well (with the same probability as two can become one!…), and you already rejected two — so you therefore must reject three as well, as it is “actually only two” (by the same measure as “two is actually only one”…). And so on: sadly, not any number of data centers are acceptable by the above argument. 😉
Although I like it (because I have always enjoyed your style), the above false reasoning ignores that:
a) the aspects of staying online cannot be properly modeled without calculating *probabilies*
b) those probabilities of staying online actually *do* increase quite significantly by doubling a single data center
c) and increase further (by less and less) when adding further data centers (as you already discussed this later in the comments with Roger).
So, one is “high risk”, two is “less risk”, and three is “even less risk”, but nothing in your reasoning tells why you settled on “even less risk” instead of “less risk”.
Instead of implying that “no amount of data centers” are enough (because n-1 is never enough), you should have explained how you have come to a trade-off between the value of being online *at a certain probabilty* (p) and the cost of falling out at a chance of 1-p, and how that calculation of your risk assessment resulted in the number 3.
Nevertheless, I’m (as an ex-ISP), sooo envy about the ease of sleeping it may mean… 🙂