Hi all, just a quick update to let you know there was a synchronization problem last night that has since been repaired. Affected users have been notified personally; if you haven’t heard anything from us, you’re good to go. More details follow below; please email firstname.lastname@example.org or email@example.com if you have any questions or concerns. Sorry for the hiccup — there goes our 99.9% uptime! (Now we’re at 99.84%)
The issue started when we started a large manual update on the database to clean up some old code. Yes, the most dangerous place for a patient to be is on the operating room table, and the riskiest changes are those where nothing is supposed to change. Regardless, the large update became an unexpectedly huge update due to the object in question (policies) being on average larger than we anticipated. This made synchronization take longer than expected between our three realtime replicated datacenters, causing a timeout to occur between the parent database and its children. The children gave up on the parent and continued on in splendid form (the older child became the new parent), and the old parent went into a “something is not right” safe mode waiting for an admin to fix it. All this happened over the span of a few minutes, resolving itself entirely automatically, without any downtime.
However, it left us in a non-ideal state: one of the children was acting as a temporary parent, and the real parent was offline in a safe hibernation mode. Toward that end we copied the database from the new parent to the old parent, and began rebuilding its database from scratch. That takes a long time (we’ve got a lot of data) so while waiting we re-executed the original query by breaking it up into a bunch of smaller queries, all was right in the world. Again, no downtime, no problems.
But things started to get complicated when we brought the old parent back into the fold. Doing a realtime handoff between three live servers — changing which is “in charge” on the fly, without dropping any requests — is difficult. Now this isn’t an unusual operation: we do it all the time. That’s how we upgrade our servers without any downtime, just upgrade them one at a time and the synchronization layer takes care of it seamlessly. Our sync layer is one part of the secret sauce that gives us incredibly high performance and uptime AND maintains incredibly tight security. We’ve done this operation literally hundreds of times before, so often that it’s become routine.
In this case, however, a simple disk error threw us off. To be completely honest, we’re still analyzing that to see exactly why the disk failed in that way at that time. But that actually wasn’t our mistake — disks fail all the time, it’s no big deal. (And any organization that treats it as a big deal obviously doesn’t deal with disks much.) Our mistake was in trying to manually correct the corrupted database rather than just re-copying it and rebuilding it. One of the downsides of doing maintenance like this in the middle of the night is we get tired, and our decisions aren’t always perfect. We should have just let it sit overnight (after all, throughout this whole process the other two databases were performing flawlessly) and dealt with it after a good night’s sleep and a big breakfast.
Unfortunately, we didn’t. So we made what seemed like a fix, ran some tests, and concluded it fixed. (One problem with running realtime replicated databases is that they’re never in *exactly* the same state due to network latency and such. So it’s actually really hard to confirm that two databases are the same without taking them offline, and as you might imagine, we’re loathe to take down the site unnecessarily.) We lit up the parent server, it synchronized and took over parenting from the children, and everything was one big happy family again.
But like most happy families (in the movies, at least), this family harbored a dark secret. The parent had a corrupted soul that slowly infected the children in dark and devious ways. More specifically, new accounts were being improperly created, receipts improperly linked, reports improperly submitted, etc.
The second we discovered this, we took the entire site down for maintenance.
Thankfully the problem had only been in the wild for a few hours. And the problem only affected new users who signed up during that period. But we’re starting to sign up users really fast anymore, so as painful as it is to admit, there were about 400 users whose accounts were affected.
As for how we know this, part of synchronizing is maintaining a log of all changes to the database. Not back to the start of time, but back for long enough such that if two servers disconnect for a bit, they can resynchronize upon reconnect by comparing their journals and see where one left off. Accordingly, this journal is a really powerful debugging tool, and in this case it pinpointed exactly which accounts were affected and in which ways.
So, to make a long story short: not a great day for Expensify. (Though we did use this opportunity to upgrade our servers’ hard drives by 8x, so that’s something.) But we’re back on our feet and have learned a few new lessons. I personally apologize for the problem and, while I can’t pledge that there won’t be more, I do promise that we take every single problem very seriously. Please write firstname.lastname@example.org if you feel your account has any problems that we missed, or write me directly at email@example.com if you have any questions or concerns. And if you were one of the very few affected users, have a cup of coffee and send me the expense report — it’s the least I can do.
Thanks for using Expensify, and have a great weekend.
Founder and CEO of Expensify