Expensify has three geo-redundant, realtime-replicated datacenters — each of which holds more than enough hardware to power the full Expensify site, and all three combined should be massive overkill. So why has Expensify been so slow these past few days? A few reasons, actually:
Massive Traffic Spike.
In short, this month is off with a bang. Nearly every day has set a new all-time traffic record, blowing away all historical averages. Being featured in USA Today, NBC News, and the Wall Street Journal definitely contributes. And traffic is always up at the start of the week, and the start of the month, so that’s a double-whammy. But I think the biggest contributing factor is our friend the IRS: the April 15th deadline is looming large among our millions of users, and they are starting to take it very seriously. This is bringing out a new user behavior we don’t have a ton of experience with: people uploading receipts in bulk by the hundreds (or even *thousands*), putting new stresses on the system. All told, it’s causing the site to break in new and exciting ways we’ve never experienced before. Most of those ways only affect small numbers of users, but today was different.
The Short Story.
A workaround is in place and a permanent fix is expected next week.
The Long Story.
Remember those three realtime-replicated datacenters? “Replicated” is a key word. Our proprietary WAN database replication technology (which we’re hoping to open source) has been dutifully and reliably powering the site for years. In a very general way, every command is processed like this:
- Webserver receives a request from the web browser
- Webserver sends the command to one of our 5 database servers
- If it’s a “read only” command, the database processes the command and immediately responds
- If it’s a “read/write” command, then it “escalates” the command to the “master” database — the server that’s in charge of coordinating all writes.
- The master processes the command using a transaction
- The master sends the SQL of the command to the child databases, who re-execute the transaction locally
- The children all respond “looks good!” as soon as they finish their transaction
- Once half the children respond with the all-clear, the master commits the transaction, notifies the children to commit, and notifies the original database that its command has been processed.
- The original database receives notification from the master that the command is done, and then in turn responds to the websever with the results.
All this normally happens in around 100ms, with the replication stage (steps 6-8) taking about 40ms. Today, it started taking on the order of 14,000ms, with replication taking about 200ms. Furthermore, all our databases were operating at 100% CPU — even though performance had slowed to a crawl, not to mention those servers are rated for far more traffic than they were receiving. All three of these seemed impossible on the surface, and even more impossible given that we didn’t change any hardware or software related to any of these systems. What could make this so slow? Three things:
- A cache
- A list
- A socket
The first seemed really obvious: let’s prioritize the important commands over the less important commands. In order to make the Report page load quick, we precalculate a variety of aspects about the report and store them in a cache. But updating the cache isn’t as important as actually responding to a user request, so we deprioritize it. This is normally fine — the cache update is put on the end of the master’s command list, and when there is a lull we update them all, without delaying any realtime user actions. What could possibly go wrong with this great optimization?
The problem is if there is no lull. Recall that we’re seeing unprecedented levels of traffic, meaning activity levels that were previously brief spikes have become the new norm. This means for long periods every day, we get so many high-priority commands that the master never has a chance to process the low-priority cache-update commands (known as “starvation”). This should normally be fine — the low-priority commands would wait until the spike finished, and then everything should update fine. No problem.
However, a byproduct of that “no problem” is an extremely large queue of escalated commands — the children sent a huge number of cache commands to the master. This should normally be fine, except for when I wrote that code (in 2009) I assumed the number of escalated commands from any child database would always be very small. So I used a list to hold it, meaning whenever any database got a response to an escalated command, it would need to iterate over that list to find a match. (And there were several other occasions that it iterated over that list.) Normally this would be fine. But if that list gets long — like, really, really long — then iterating over that list at extremely high frequencies gets very expensive. My bad.
And this exacerbated the third issue. Recall that when the master goes to commit a transaction, it waits for approval from two children. Normally that happens incredibly fast. However, when the children started seeing their CPU eaten up by the list iteration, that caused our load to increase substantially — even though it wasn’t doing anything useful. This had the unfortunate side effect of making the child databases just generally slow down, meaning it was slow to process messages from peers. In particular, it meant that the child servers started processing replication commands slower than it would otherwise.
The sum of all three created a vicious cycle: as the child server’s CPU increased, the replication speed decreased, causing the backlog of low-priority command to grow, repeat.
Once we understood this, the first thing we did was disable that cache-update command, then restart all the child databases, and then restart the master. This cleared the backlog, caused CPU to drop, and caused replication speeds to increase back up to normal. Problem solved… sorta.
The Quick Solution
However, replication speeds were still too slow. Replication is highly influenced by network latency. The latency between our Los Angeles and Santa Clara datacenters is around 10ms. But between either of those and Las Vegas is 40ms. It hadn’t always been this way: we’d known that there was some problem that slowed down that link, but it hadn’t been a major problem… until our traffic started going up dramatically in March. This means that replication has slowed to a point where at peak traffic it can’t keep up. This causes a backlog of write commands that delays the processing of commands. (The commands individually are going at full speed, but delayed.) To mitigate this, we moved the master from Las Vegas to Santa Clara, meaning that it could replicate quickly down to LA (without waiting for the slower Las Vegas link), speeding up replication speeds by about 4x and making all right in the world.
The Long Term Solution
But that’s still not a great solution: our report-caching command is still disabled, and replication speeds are still not where we’d like them. So we still have work to do:
- Stop using a list of escalated commands, and replace with a “map” — allowing fast lookup without the CPU problem.
- Re-enable the report caching command, which will speed up the Reports page.
- Reduce latency between our datacenters. This can be done with a dedicated fiber connection, which we’re investigating now. This will generally increase performance across the board, as well as provide a bigger buffer against temporary replication speed problems.
The final conclusion is that we’re terribly sorry about the delays. We agonize over every problem encountered by our users, and are always disappointed when something slips through the cracks. We’ve invested a tremendous amount of time, people, and hard cash into the best possible solution, and we do everything we can to anticipate and prevent any conceivable problem. But this is complex stuff, and sometimes our best efforts come up short. Thank you for your patience and I appreciate your understanding as we continue to grow this company by leaps and bounds.
Being a support professional, I am really impressed. That is an EXCELLENT update, astonishing that a CEO would take the time and trouble to present something with such clarity to the user community. Says great things about your company.
based on this what you think are low traffic time.. ppl will like to work during that time if we can.
I appreciate the reply and look forward to more stability in the future. When the system works we love it when it doesn’t it causes us huge delays and pain.
Thank you David for your detailed summary of cause and effects. If you think things are tough now, wait until you get on the China hackers radar and start getting the DOS attacks that have been happening to major banks around the country. These attacks have been registered at over 100 gbps at some sites. I experienced a similar problem at Bank of America that you are having with DB replication at remote backup datacenters. We solved our problem by changing the slow legs to be asynchronous updates from a queue of logged transactions, rather than trying to keep them synchronous and real time.
Thank you for be so clear and close to us. I would like to have more people like you in Spain. Go on!!
David. Thanks for taking the time to write this very clear explanation. It speaks volumes about the company you are building.
Great article – really appreciated the learning as a B2B Startup and as a user the proactive communication and ownership of the issue makes my patronage feel valued.
Wow. A for-real root cause analysis, with short term and long term corrective actions. In the kind of work I do, I don’t see that often enough – bravo, guys!
Excellent address to the issues, and thank you so much!
Hi Dave, Maybe you should look at some non SQL database like MongoDB? Sure it could solve a lot of your headaches with scaling out.
You guys continue to rock. Keep up the good work, David & team.
Thank you David but as an IT services company, we’re wondering why you’re not using AWS or Google Cloud Platform. Wouldn’t it help focus less on infrastructure and more on the application?
As a French cloud solutions reseller, we’ve been waiting for months for your team to solve the little issues that prevent us from selling Expensify to our clients (French special characters when typed on the Android app, ability to use kilometer-based mileage, etc.).
Thank you anyway for your response and keep running.
Hi Olivier, the issue isn’t lack of capacity in any single datacenter, but in challenges replicating between datacenters. EC2 is great (we use it for some systems), but it’s not a magic bullet — nor immune from downtime itself.
Great to see a very transparent update. One of the hard things to deal with – unprecedented success. I have to wonder if Apache Cassandra would help, having been built to replicate across geographically separated datacenters. Netflix has Cassandra clusters that span North America, South America and Europe.
It’s a good suggestion, but Cassandra is typically for unstructured data — our data is actually very structured. (Relational databases were invented for financial data, after all.) But we’re considering moving some of our unstructured data over to a noSQL system like that. So many options!
Wow! As a customer, a customer service/office manager for 25 years and now a business owner, I highly commend you on your ethics, Sir. The chances of the CEO of any other company coming out with not only his own clear, consise and insightful explaination for a problem with his company, but an equally clear and concise plan to correct and irradicate the issue are little to none. I have also had the pleasure of dealing with your tech support staff and must say they are just as clearand strsightforward as you must be. Every promise has been kept, I have seen progress towards resolution and believe it will come any day now. Kudos to you and your team Mr. Barrett. I wish you multiplied success in all your ventures. Would love to upgrade but healthcare issues prevail. They won’t always though. I will win the battle and the waar, as will you! 🙂
Great update. It’s refreshing for a customer to be treated as an intelligent adult. Thank you.
Well done, really appreciate the openness and accountability to the issues.
P.S. I’d also second the investigation of AWS for your purposes. My company uses it exclusively to address many of our big data needs and I’m sure you would benefit as well for this application.
Agree with others, if more companies posted updates like this there would be a lot less stress in the world. Keep up the good work.
I have been exactly where you are (and back) and know what you are going through. This will seem like a minor roadbump a few months from now. Your product is awesome.
I, too, appreciate the detailed update on why it was taking so long to do work on the site last week. I thought it was trouble on my mobile connection :).
This is a great demonstration of the transparency and commitment to customer service that Expensify lives by and I am certain that it helps build loyalty from customers who understand growing pains, but can’t stand corporate gobblydygook speak.
Keep up the good work, keep striving to improve, and keep us informed (and let us know when the best times for us to work on the site would be until the situation is resolved fully.).
David: fwiw, Cassandra also handles structured data as well, as it’s a column family store. The recent updates to the CQL language make it simpler as well, e.g. http://www.datastax.com/docs/1.1/references/cql/index
Thank you David for your replies.
LOVE Expensify. Great company, product, and leader. Keep up the good work. A few road bumps, but you are already on the right track to keep all us customers happy.
Very impressed with this company.
Still having problems with the slowness and also failures to update CC properly–keeps stating server errors for Expensify
We don’t have any other reports of sluggishness, but we’ll be glad to look into your particular case if you’ll follow up with email@example.com. Thanks for your cooperation!
You guys should invest in some Avere nodes! latency masking across multiple environments!
Thank you for your report! I too impressed by the responsiveness of Expensify and thoroughly enjoy using it.
I too am amazed by the detail and frankness.
Just so you know David – I have been working for 15 years with an HP product called LoadRunner. It is used to stress test applications, then tune them for better performance where you can see the improvement as you make each configuration as you discuss in your post.
I would love the opportunity to work with Expensify, as your corporate logo is right on – expense reports that don’t suck.
I hesitate to call this corporate transparency because I think that description falls short of the example David and his team exemplifies time after time. What I see from this post is corporate culture that gently wraps the user into the experience, the culture and company. I may not be able to run into David Barrett in the hallway and share a quick anecdote but I still feel strangely connected to the company through posts like this. We all are. For me, I am happy to ride the wave with you. Well done.
Well that explains the picture of a glacier that came up on your start page
Don’t even sweat it David. I noticed the downtime but caught up by the next day. I’m willing to deal with a little delay in using your product because it rocks so much more than any other expense tool I’ve used. With that said, it’s still cool to see you guys working hard on fixing it.
Thanks for making expensify!
Great RFO, and also an interesting technical read. I’d love to see the DB once (or if) it goes Open Source, I can see it making a few technical challenges we’re facing a little easier.
(I’m a big fan of ACID based DBs. The concept of “eventual consistency” always worries me with critical data)
I live in Singapore, and have lots of Apple stuff. It absolutely boggles my mind how unresponsive Singapore Airlines and Apple have been about recent failures (SQ website; Apple Maps). They both have great products, but their refusal to acknowledge and update progress on fixing “hiccups” is a huge destroyer of good will. Your approach makes me feel connected to Expensify and not only willing to be patient, but to up my ante by “upselling” myself away from the basic “core” level user.
Skype, Apple, Singapore Airlines, etc.? Pay attention to how to “own” your performance.