Best Practices and the Law of Unintended Consequences

Philip Sharp —  January 28, 2014 — Leave a comment

At Expensify, everything we do is a balance. As a startup, we can’t build every feature we and our users want, or install as many servers as we can imagine. Sometimes though, we see a change we can make that won’t cost much (in time or money) and will benefit for our users. Here’s the story of one of those times that didn’t work out as well as we hoped.

Most of our website is written in PHP. While there is some healthy debate among our engineering staff, most of us like PHP for its rapid development and ease of deployment. Our web servers use the Alternative PHP Cache (APC) to cache compiled code and speed up requests for our users. A few months ago we updated our web server configuration to use less memory for each PHP process. 

The original configuration looked like this:

fastcgi.server = (
    ".php" => ((
        "bin-path" => "/usr/bin/php5-cgi",
        "socket" => "/tmp/php.socket",
        "max-procs" => 96,
        "bin-environment" => (
            "PHP_FCGI_MAX_REQUESTS" => "500"
         )
    ))
)

It was a very standard configuration, which creates 96 PHP parent processes per server, each of which creates one child “worker” process. However, APC stores the cache in memory and creates one cache per parent process. That means we were caching the same data 96 times!

The best practice given in documentation for both PHP and Lighttpd is to start 1 parent process, and let it create all of the child processes. The child processes will share the parent’s APC cache.

The new configuration looked like this:

fastcgi.server = (
    ".php" => ((
        "bin-path" => "/usr/bin/php5-cgi",
        "socket" => "/tmp/php.socket",
        "max-procs" => 1,
        "bin-environment" => (
            "PHP_FCGI_CHILDREN" => "96",
            "PHP_FCGI_MAX_REQUESTS" => "500"
        )
    ))
)

This saved us the memory space of 95 extra caches, which we could use to support more users on each server.

Until, a few weeks later…

One Monday morning we found the site was running very slowly. We we checked our server logs, we found that the web server was being overloaded, and that the PHP processes could not handle the number of simultaneous requests that they did previously. We reverted to the old configuration, and the site started responding better.

We started researching further to determine what the problem was. With the help of Apache Bench we tested several possible configurations and found a surprising pattern. Each server could handle a maximum of PARENTS * CHILDREN + 129 * PARENTS connections. This meant that our original configuration could handle 12,480 connections per server, but our new configuration could only handle 225!

More research showed us that we were hitting the default maximum socket connections (128), that allow a backlog for each PHP process. Because we had fewer PHP parent processes, even though we had the same number of child processes, we had created a bottleneck that our traffic couldn’t fit through during Monday morning. (Which is, we know, the best time to do your expense reports.)

The lesson: best practices may be a good idea but they can cause other problems, and free ideas can have costs.

No Comments

Be the first to start the conversation!

Have something to say? Share your thoughts with us!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s