Sep 302014

On August 1, 2014, Facebook went down. It came back after a few hours or less, but it was a visible reminder of their (now-former) motto of “Move fast and break things.” I made a joke about the issue, but I appreciate the philosophy, even if Facebook’s since tried to move away from it. I think it has a lot to do with their new model of “Move fast with stable infrastructure.” In fact, I think moving fast and breaking things is how they got their stable infrastructure.

“Move fast and break things” is a little bit of a misnomer, it’s really more like “move fast and find out what’s broken.” Here’s the thing nobody points out when they’re telling you to just ship stuff and fail fast. They talk about how that kind of thing lets you go ahead and get feedback from your users. What it really does is put the biggest, nastiest, most pervasive errors all over your logs, monitoring, and everything else that could possibly tell you about your application. That is, from my experience, the single best means of making sure things work.

I’m not trying to belittle the roll of QA and testing during the development process, but what I am saying is that there are tons of possible ways things can go wrong in production, and there is no QA good enough to think of them all. Users do a lot of crazy stuff, some of it even intentionally. You can probably get a lot of the simple stuff during the development process, but to truly see just all the myriad of ways things can possibly go wrong, you need to have something somebody is actually trying to use. The more data you have, the more options the user has, and really the more anything in your software, the more ways it can go wrong, so once you get the obvious possible errors running smoothly, then it’s time to see where the real problems are, as opposed to the hypothetical problems.

This philosophy has worked pretty well, for a lot of companies. Anyone who followed Twitter during it’s early days probably remembers there was a period in the service’s history where users were often met with the “Fail Whale” instead of actually getting Twitter. At the time, Twitter’s infrastructure was nowhere near being good enough to handle the type of usage it was actually getting. However, what exactly were their weak links and what needed to be fixed never really showed themselves until the world at large had a chance to pound on Twitter. Sure, Twitter could have load-tested, but that only helps if Twitter can accurately estimate or overestimate its usage. What about invalid data (I don’t mean something like “abc” in an “Amount” field, but a negative number in a “Number of ____” field)? What about inconsistent data? What about poorly formatted data? These are all things you can try to test for, but any software product worth having is going to have more users than there were people testing it. That is your single most thorough, most complete, most accurate set of unit and load tests you could possibly have. They’re reaction to the software is also the only acceptance criteria that matters.

Of course, being willing to put stuff out and see what all breaks is all well and good, but now you have to fix it, and fast. After all, your software is broken and your users can clearly see that. But now you have a couple of things going for you. A completely accurate sense of what’s the most important stuff to fix, and also a list of the stuff that technically works correctly, but was designed wrong in the first place (missing features, stuff that doesn’t really solve a problem the users were actually having, etc.). The whole point with releasing software that you know isn’t perfect is to much more quickly and effectively find those imperfections. Monkeys on typewriters may eventually come up with Shakespeare, but random users will find a pretty exhaustive list of bugs surprisingly quick.

The sooner you fix those bugs and design flaws, the sooner you have a stabler, faster, likable application. Not only that, but the whole time you’re getting that nice, positive feedback cycle of seeing the list of issues people are actually having shrink and shrink to somewhere between “not worth the effort” and “nothing”. It also lets your users see their concerns being responded to promptly, which makes them happy.

Lastly, blogging about the issues you had and fixed accomplishes a few good things for you. First, it documents what happened and how it was fixed, just in case somebody else runs into a similar problem. Secondly, it’s a very public signal that you’re actively working on the application, as well as where you’re going with the development process. Lastly, it’s a great way to do a post-mortem. You’re collecting and organizing the information you’ve since learned about whatever was broken, and since nobody cares who exactly wrote what or the exact technical intricacies caused the problem. Thus a blog post on what happened keeps you focused on the original design of your broken code, why that design was broken, what you did to fix it, what you learned from fixing it, and how that’s going to be applied both to new development and going back to fix legacy code.

Nobody’s perfect, and there are all sorts of ways for software to go wrong. And while every test seems obvious in hindsight, the reality is the bugs in your code probably weren’t that obvious at the time. Instead of trying to anticipate all the possible ways into the field, you’re best bet is to field the software and actually catalog all the myriad of ways things can go wrong, then spend the bulk of your time focusing on actual problems rather than trying to successfully hypothesize all your potential problems. The important thing to remember is that none of this matters if you don’t jump all over problems as soon as they come in. Failing to respond to problems found in production leads to users considering your software unreliable, and software that’s unreliable is software that’s unusable. Software that’s unusable is software that isn’t going to make anybody any money. On the other hand, getting into the habit of focusing on real-world bugs found by real-world users gets you into the habit of dropping everything to fix bugs. In fact, it gets you into the habit of responding very quickly to user feedback period. That’s the kind of thing that can make and keep you successful in the long run. Lastly, you should be publicly documenting all of this. People tend to remember things better after writing them down. Having the developer(s) involved in dealing with any significant (and maybe even non-significant) issue write a blog post on the topic cements what they’ve learned, in addition to creating a useful reference for other developers. It’s also a good way to communicate back to your users about your progress. You should fix any foreseeable problems with your software, but you shouldn’t spend time trying to divine all the possible problems when your users are more than capable of providing you a very exhaustive list instead.


 Posted by at 10:30 PM