Expensify has three geo-redundant, realtime-replicated datacenters — each of which holds more than enough hardware to power the full Expensify site, and all three combined should be massive overkill. So why has Expensify been so slow these past few days? A few reasons, actually:
Massive Traffic Spike.
In short, this month is off with a bang. Nearly every day has set a new all-time traffic record, blowing away all historical averages. Being featured in USA Today, NBC News, and the Wall Street Journal definitely contributes. And traffic is always up at the start of the week, and the start of the month, so that’s a double-whammy. But I think the biggest contributing factor is our friend the IRS: the April 15th deadline is looming large among our millions of users, and they are starting to take it very seriously. This is bringing out a new user behavior we don’t have a ton of experience with: people uploading receipts in bulk by the hundreds (or even *thousands*), putting new stresses on the system. All told, it’s causing the site to break in new and exciting ways we’ve never experienced before. Most of those ways only affect small numbers of users, but today was different.
The Short Story.
A workaround is in place and a permanent fix is expected next week.
The Long Story.
Remember those three realtime-replicated datacenters? “Replicated” is a key word. Our proprietary WAN database replication technology (which we’re hoping to open source) has been dutifully and reliably powering the site for years. In a very general way, every command is processed like this:
- Webserver receives a request from the web browser
- Webserver sends the command to one of our 5 database servers
- If it’s a “read only” command, the database processes the command and immediately responds
- If it’s a “read/write” command, then it “escalates” the command to the “master” database — the server that’s in charge of coordinating all writes.
- The master processes the command using a transaction
- The master sends the SQL of the command to the child databases, who re-execute the transaction locally
- The children all respond “looks good!” as soon as they finish their transaction
- Once half the children respond with the all-clear, the master commits the transaction, notifies the children to commit, and notifies the original database that its command has been processed.
- The original database receives notification from the master that the command is done, and then in turn responds to the websever with the results.
All this normally happens in around 100ms, with the replication stage (steps 6-8) taking about 40ms. Today, it started taking on the order of 14,000ms, with replication taking about 200ms. Furthermore, all our databases were operating at 100% CPU — even though performance had slowed to a crawl, not to mention those servers are rated for far more traffic than they were receiving. All three of these seemed impossible on the surface, and even more impossible given that we didn’t change any hardware or software related to any of these systems. What could make this so slow? Three things:
- A cache
- A list
- A socket
The first seemed really obvious: let’s prioritize the important commands over the less important commands. In order to make the Report page load quick, we precalculate a variety of aspects about the report and store them in a cache. But updating the cache isn’t as important as actually responding to a user request, so we deprioritize it. This is normally fine — the cache update is put on the end of the master’s command list, and when there is a lull we update them all, without delaying any realtime user actions. What could possibly go wrong with this great optimization?
The problem is if there is no lull. Recall that we’re seeing unprecedented levels of traffic, meaning activity levels that were previously brief spikes have become the new norm. This means for long periods every day, we get so many high-priority commands that the master never has a chance to process the low-priority cache-update commands (known as “starvation”). This should normally be fine — the low-priority commands would wait until the spike finished, and then everything should update fine. No problem.
However, a byproduct of that “no problem” is an extremely large queue of escalated commands — the children sent a huge number of cache commands to the master. This should normally be fine, except for when I wrote that code (in 2009) I assumed the number of escalated commands from any child database would always be very small. So I used a list to hold it, meaning whenever any database got a response to an escalated command, it would need to iterate over that list to find a match. (And there were several other occasions that it iterated over that list.) Normally this would be fine. But if that list gets long — like, really, really long — then iterating over that list at extremely high frequencies gets very expensive. My bad.
And this exacerbated the third issue. Recall that when the master goes to commit a transaction, it waits for approval from two children. Normally that happens incredibly fast. However, when the children started seeing their CPU eaten up by the list iteration, that caused our load to increase substantially — even though it wasn’t doing anything useful. This had the unfortunate side effect of making the child databases just generally slow down, meaning it was slow to process messages from peers. In particular, it meant that the child servers started processing replication commands slower than it would otherwise.
The sum of all three created a vicious cycle: as the child server’s CPU increased, the replication speed decreased, causing the backlog of low-priority command to grow, repeat.
Once we understood this, the first thing we did was disable that cache-update command, then restart all the child databases, and then restart the master. This cleared the backlog, caused CPU to drop, and caused replication speeds to increase back up to normal. Problem solved… sorta.
The Quick Solution
However, replication speeds were still too slow. Replication is highly influenced by network latency. The latency between our Los Angeles and Santa Clara datacenters is around 10ms. But between either of those and Las Vegas is 40ms. It hadn’t always been this way: we’d known that there was some problem that slowed down that link, but it hadn’t been a major problem… until our traffic started going up dramatically in March. This means that replication has slowed to a point where at peak traffic it can’t keep up. This causes a backlog of write commands that delays the processing of commands. (The commands individually are going at full speed, but delayed.) To mitigate this, we moved the master from Las Vegas to Santa Clara, meaning that it could replicate quickly down to LA (without waiting for the slower Las Vegas link), speeding up replication speeds by about 4x and making all right in the world.
The Long Term Solution
But that’s still not a great solution: our report-caching command is still disabled, and replication speeds are still not where we’d like them. So we still have work to do:
- Stop using a list of escalated commands, and replace with a “map” — allowing fast lookup without the CPU problem.
- Re-enable the report caching command, which will speed up the Reports page.
- Reduce latency between our datacenters. This can be done with a dedicated fiber connection, which we’re investigating now. This will generally increase performance across the board, as well as provide a bigger buffer against temporary replication speed problems.
The final conclusion is that we’re terribly sorry about the delays. We agonize over every problem encountered by our users, and are always disappointed when something slips through the cracks. We’ve invested a tremendous amount of time, people, and hard cash into the best possible solution, and we do everything we can to anticipate and prevent any conceivable problem. But this is complex stuff, and sometimes our best efforts come up short. Thank you for your patience and I appreciate your understanding as we continue to grow this company by leaps and bounds.