Expensify, just like all tech companies in the world, fights bugs on a daily basis. Most of them affect an insignificant part of our users (i.e. an app crash when resuming on a specific page). Some of them have an important impact on the main flow (i.e. taking a picture takes 10 seconds longer than usual).
But sometimes, one small line of code will produce a bug that critically affects 100% of our users.
On December 18th, Expensify App 4.2.6 was published on the Apple Store after a one-week internal review. 4.2.6 was supposed to fix some medium-level crashes that 4.2.5 introduced. A few minutes later, we got an email from our #1 customer complaining about this new version: the app was crashing when going to the expenses list.
100% of our users couldn’t use the main feature of our app.
For those who already experienced such a thing, you know how bad it is. At this point, it would take at least two days to fix the issue since the App Store’s reviewing process is fairly slow.
We’ve always released our app on the Play Store first, using the custom roll-out feature that allows to progressively add more users in the release pool. When Android reaches 100% of the pool, we release the app on the App Store. As all our apps share the same code, it’s highly unlikely that the app crashes on iOS and not on Android.
We also had a few automated tests for Android & iOS that made sure that the main flows weren’t affected by our changes. And of course, all our versions are released internally so members of the team can test it and report any bug/crash they can find.
And yet, we had released the worst app version in Expensify’s history. It turned out that two of those three crucial barriers weren’t applied correctly. First of all, we didn’t run the automated tests on the very latest version of the code, but on the commit just before. The very last commit looked like this:
This great typo totally passed our reviewing process and got merged into the master branch. To our foolish minds, such a small code change didn’t require launching a 1-hour long test, nor did it require publishing a new internal release. Also, this crash didn’t affect Android because this part of the code was never executed due to inconsistency between platform calls.
We’ve managed somehow to make the app stop crashing with a hack in the API, but the whole resolution time took about a day. Each time we release such a crappy version, we lose the trust of our customers, our image gets deteriorated piece by piece, and at the end of the day the Expensify App appears to be the work of amateurs incapable to create quality products (We’re pretty sure the last one isn’t true).
This Can’t Happen Again
Following this “fire”, we’ve decided to drastically improve our release process, starting with the build system. We’ve shifted from manual builds to a complete Jenkins build system called Expensibuild.
This email describes exactly what changed in this release compared to the previous one, so we know what to test. If the tests failed, an alert is sent to the mobile team urging it to fix the bug.
Making people test
But automated tests are not enough, there are flows that you can’t reproduce accurately with computers and you’ll always need real people to actually test your product before releasing it. However, people are busy and position specific priorities are a formidable force to infringe on.
So to encourage people to use the app, we reimbursed their afternoon coffee break as long as they used the latest version to expense it. Whenever Expensify’s employees go to the local coffee shop, they just have to SmartScan their receipt to get it reimbursed. We’ve been able to detect multiple bugs that our tests didn’t catch with this little trick.
Improve Bug Catching
Each crash comes along with logs, user identifiers, device identifiers, and so on.
We’ve also added code to get a snapshot of the screen at the very time when a crash occurs, as well as the position of the last touch perform on the screen and the delay between the touch and the crash. This allows us to know exactly what happened and reproduce the bug easily.
The 4.2.7 release was the first to reach 5/5 stars on both the App Store & Play Store during its 2-month lifespan. It was also incredibly stable: only 1,500 crashes over about 3,000,000 sessions. Users were happy, and so were we.
But most of all, implementing those processes has removed the heavy work that we were doing manually: linting, testing, and building. It gave more time for engineers to focus on real things: fix bugs & implement features.
It’s sometimes better to think about what could be improved, what you can delegate to computers or what makes your process not good enough. Having this kind of introspection is essential for a team to enhance its process, but sometimes it takes a fire to remember that.
And, well, if you want to experience the joys of YAPL (our rockstar cross-platform mobile framework) and fire fighting, feel free to send us an email at email@example.com, we’d love to hear from you!