I screwed up, you shouldn’t too.

General Commentary, Programming, Work Add comments

Nov 012013

1 of the things I work on regularly was originally created with a fairly consistent data set. We had a very good idea of what fields were going to be in the data. As a result, we thought we knew where exactly we needed to check for null data, and when we didn’t need to know. Then the data set expanded, and the new data wasn’t nearly as consistent as what we started with. That’s when we realized that we were very much wrong in thinking we knew what needed null checks.

The initial problem

The first thing we realized is that we should never have operated under the assumption any fields would consistently be present, and instead should have been null-checking everything. After a few NullPointerExceptions bug reports started coming in, we decided we should be more assertive about checking for nulls, so we decided to add checks throughout the code around the third bug report or so.

In the spirit of being assertive, I decided to set my assertiveness level somewhere around tinfoil-hat-paranoid. I checked for nulls before I referenced anything. I checked for empty data, because if it could have possibly caused an issue relating to no data being there, then I was going to stop it in its tracks, damnit.

How that worked for us

I did this, I ran the regression tests we had at the time, we put it into QA, then staging, then live in production. And then…we realized this was a cluster****. 2 issues popped up before we rolled the changes back. 1 was a typo on my part that prevented data that was actually there from appearing in results, the other was an empty check that actually broke functionality our users were relying on.

So where did I go wrong? First off, I made application-wide changes to deal with a hotfix. While everyone knew I made changes all throughout the code unrelated to the specific bug report that started this, I didn’t demand that this stop being treated as a hotfix and start being treated as just a straight-up update, nor did I treat this as anything other than a hotfix myself. We either should have patched NullPointerExceptions as we found them, or done a full update to address this, but pushing everything through as a quick hotfix was a disaster.

Remember when I said that I was checking for any possible situation where no data could conceivably trip the code up, null or empty? Empty wasn’t ever actually a problem. I don’t mean that in a “we either had null or objects with data in them” sense, I mean it in a “there were fields denoting no data that users were counting on that I stopped populating when I stopped processing in its tracks the second I saw empty objects” sense. The lesson here is simple, before you check for a problem condition be absolutely certain that it actually causes a problem.

Other ~~scars earned~~ lessons learned

The main thing I learned from this is to never trust the data. Not from a third-party, not from our databases. Assume it’s bad. Assume stuff that should be there, that has absolutely no logical reason for not being there, could be missing. Another thing I need to start remembering going forward is to assume that even if there is data somewhere I think it should be, it may not be the right data. Maybe it’s a string when I wanted an integer. Maybe it’s an double when I wanted an integer, or a string. Whatever the case, if I need a particular type of data, I’m verifying that type before I use it.

Going back and trying to be paranoid about your data is incredibly risky. Odds are, when the code was first written the data was fairly predictable, which is why these kinds of checks aren’t there now (after all, hardly anybody goes through and removes null checks). It’s also incredibly hard to test if you don’t control the data source yourself, or if you weren’t checking all along. You can’t really tell what all could be null or under what circumstances it would be null. You also don’t have the ability to arbitrarily stick null values into the data to create a test scenario yourself. Ideally this is why you’d be checking everything for null values from the beginning, but when you start with a fairly consistent data set.

Conclusion

It’s incredibly tempting to think of checking for null data as a very low-risk thing to do, but that doesn’t excuse rushing through testing and QA, especially if you’re doing it all over the place. Another thing to remember is that empty may not be nearly as much of a problem as null is. When checking for no data, you’ll need to think about the various types of “no data” you can encounter and just what the ramifications of each of them are before you stop trying to deal with them. Go be a better programmer than me, don’t do what I did.