Building a Fireproof App (After the Old One Caught Fire)

ldiqual —  May 1, 2014 — Leave a comment

Expensify, just like all tech companies in the world, fights bugs on a daily basis. Most of them affect an insignificant part of our users (i.e. an app crash when resuming on a specific page). Some of them have an important impact on the main flow (i.e. taking a picture takes 10 seconds longer than usual).

But sometimes, one small line of code will produce a bug that critically affects 100% of our users. 

The Bug

Image

On December 18th, Expensify App 4.2.6 was published on the Apple Store after a one-week internal review. 4.2.6 was supposed to fix some medium-level crashes that 4.2.5 introduced. A few minutes later, we got an email from our #1 customer complaining about this new version: the app was crashing when going to the expenses list.

100% of our users couldn’t use the main feature of our app.

For those who already experienced such a thing, you know how bad it is. At this point, it would take at least two days to fix the issue since the App Store’s reviewing process is fairly slow.

We’ve always released our app on the Play Store first, using the custom roll-out feature that allows to progressively add more users in the release pool. When Android reaches 100% of the pool, we release the app on the App Store. As all our apps share the same code, it’s highly unlikely that the app crashes on iOS and not on Android.

We also had a few automated tests for Android & iOS that made sure that the main flows weren’t affected by our changes. And of course, all our versions are released internally so members of the team can test it and report any bug/crash they can find.

And yet, we had released the worst app version in Expensify’s history. It turned out that two of those three crucial barriers weren’t applied correctly. First of all, we didn’t run the automated tests on the very latest version of the code, but on the commit just before. The very last commit looked like this:

typo

This great typo totally passed our reviewing process and got merged into the master branch. To our foolish minds, such a small code change didn’t require launching a 1-hour long test, nor did it require publishing a new internal release. Also, this crash didn’t affect Android because this part of the code was never executed due to inconsistency between platform calls.

We’ve managed somehow to make the app stop crashing with a hack in the API, but the whole resolution time took about a day. Each time we release such a crappy version, we lose the trust of our customers, our image gets deteriorated piece by piece, and at the end of the day the Expensify App appears to be the work of amateurs incapable to create quality products (We’re pretty sure the last one isn’t true).

This Can’t Happen Again

Something to know about the mobile team is that we are only 3 people, one of them being part-time. And yet we manage to support 4 platforms (iOS, Android, BlackBerry, Windows Phone) by having a shared Javascript code and a custom framework (YAPL) that manages platform calls. The app looks exactly the same on all platforms, which allows us to have the same Calabash tests for both Android & iOS.

Expensibuild

Following this “fire”, we’ve decided to drastically improve our release process, starting with the build system. We’ve shifted from manual builds to a complete Jenkins build system called Expensibuild.

expensibuild

Expensibuild runs on a Mac Mini on which are connected two iPhone & Android devices. Everyday it pulls the master branch, lints the Javascript code with jshint, launches the test suite (which currently has about 35 scenarios), and sends an internal TestFlight to everybody in the company and notifies everyone with an email.

testflight_email

This email describes exactly what changed in this release compared to the previous one, so we know what to test. If the tests failed, an alert is sent to the mobile team urging it to fix the bug.

Making people test

coffee_policy

But automated tests are not enough, there are flows that you can’t reproduce accurately with computers and you’ll always need real people to actually test your product before releasing it. However, people are busy and position specific priorities are a formidable force to infringe on.

So to encourage people to use the app, we reimbursed their afternoon coffee break as long as they used the latest version to expense it. Whenever Expensify’s employees go to the local coffee shop, they just have to SmartScan their receipt to get it reimbursed. We’ve been able to detect multiple bugs that our tests didn’t catch with this little trick.

Improve Bug Catching

We’ve also added a custom crash reporter to all platforms to collect Javascript crashes on our servers. Each crash type creates a Github issue so we can know exactly whether it’s fixed and by whom. Our goal is to fix the top 5 crashes for each release.

mobileportal

Each crash comes along with logs, user identifiers, device identifiers, and so on.

crashes

We’ve also added code to get a snapshot of the screen at the very time when a crash occurs, as well as the position of the last touch perform on the screen and the delay between the touch and the crash. This allows us to know exactly what happened and reproduce the bug easily.

snapshot

Results

The 4.2.7 release was the first to reach 5/5 stars on both the App Store & Play Store during its 2-month lifespan. It was also incredibly stable: only 1,500 crashes over about 3,000,000 sessions. Users were happy, and so were we.

But most of all, implementing those processes has removed the heavy work that we were doing manually: linting, testing, and building. It gave more time for engineers to focus on real things: fix bugs & implement features.

It’s sometimes better to think about what could be improved, what you can delegate to computers or what makes your process not good enough. Having this kind of introspection is essential for a team to enhance its process, but sometimes it takes a fire to remember that.

And, well, if you want to experience the joys of YAPL (our rockstar cross-platform mobile framework) and fire fighting, feel free to send us an email at jobs@expensify.com, we’d love to hear from you!

No Comments

Be the first to start the conversation!

Have something to say? Share your thoughts with us!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s