Good logging is an art that I wished I had learned in school

General Commentary, Programming, Work Add comments

May 272016

You’re sitting there at work and word reaches you that your application is misbehaving. What’s the first thing you do? Pull up the logs and see what they say. And this is where you may wish that you had an entire class in school dedicated to the art of writing useful log messages. Good logging is an art, and it’s 1 that most of us don’t learn until we’re stuck trying to diagnose something in the wild with subpar messages to guide us.

The most important thing about logs is that you want to have enough information that you can easily see what’s going on with your application at any given instant so you can figure exactly what’s happening when support comes by to tell you there’s something going on. The downside is most of us are hesitant to load up on logging messages because we’re worried about cluttering up the logs. The end result is that we end up with wildly insufficient logging.

The reality is, after multiple rounds of debugging where step 1 winds up being push out an update with more logging messages so I can get a better idea of what’s going on, worrying about too many log messages is a waste of time. When the time comes to really dig through the logs, you’re going to want all the information you can get. Plus, it’s pretty easy to learn to filter out the log messages that mean everything’s fine. If you don’t believe me, tell me how distracting you find your applications startup logs. The answer is “not at all” because you’ve learned the expected pattern and the only time those log messages get your attention is when they break that pattern. The same is true of your regular application logs. You don’t care about specific messages unless your searching for the records of a particular user’s actions, or you notice something breaking the expected pattern. Those are the only messages that actually capture your attention anyways.

I’m finding that the biggest thing about good logging isn’t limiting the amount of data in your logs – it’s making sure your logs are somewhere easy for all the developers on your team to go through and making them easy to search. We’re using these logs to prove that the applications is working or to drill down into a specific set of events to try to see what happened. There are plenty of services that can help with that (the ELK stack comes to mind, but there are others out there), but they don’t do you any good if your logs don’t have enough entries to tell you anything.

In addition to making sure you’re logging as much information as you can, you also need to make sure the logs are both easily searchable and actually informative once you start focusing in on a particular set of entries. I’ve found that having a predictable pattern for log entries helps with this. I don’t just mean the standard log formatting stuff – the pattern you configure your log with so that when you make the call to actually log something each entry is normalized. I mean following a pattern in your log messages themselves. For example, we’re starting to make sure all our log entries follow the pattern {}: Your log message goes here where the “{}” is a placeholder for some type of unique identifier for the object in question. Now if a customer has a question about something, I can look up the unique ID for the data object(s) in question, then find the log entries corresponding to that user’s activities during the time window I’m interested in.

Here’s an example of where better logging practices would have saved me a lot of debugging trouble. First off I was looking into some jobs logic in response to a customer ticket, and I could see where the job started running, but couldn’t confirm that the job had finished. I was also seeing error messages in the logs shortly after each run of that particular job. The problem was, there were a lot of jobs running around the same intervals on that box, so I couldn’t be certain whether the stack traces I was seeing were the job in question failing, or something else. I had to push an update just to get an idea of where exactly the failure was occurring. It turns out that, yes, the exceptions in the logs were related to the job I was looking at, but I wasted valuable time pushing an update to try to learn that instead of writing and deploying a fix.

Another example is this – I had some gotten some reports of users not being able to successfully refresh some credentials. Now, I can look at the code and see the call being made to refresh credentials, pull the appropriate values from the configuration file, and see what’s getting returned, but there’s no line for that in the logs. Now I’m having to re-create the scenario to get the error message so I can research what causes it, instead of just reading the error message from the logs and jumping straight into the researching and fixing it part of the proceedings.

All of these thoughts on logging are a fairly recent development for me, and I have a lot of old code that has terrible logging associated with it. But, I am trying to make sure I have plenty of logging in the code that I’m touching now, with the hope that eventually logging will be improved all over, although that’s likely to be a long, hard slog. I’m committed to pushing through with it though. More logging isn’t really going to hurt anything, and if it makes debugging easier in the future, then all the better.

Right now, my general philosophy on logging is as follows:

INFO for enough debugging to verify that something’s running without failing. At a bare minimum, this includes a log entry for entering and leaving a method, as well as enough log statements to give me a basic idea of about where in my code things went wrong should an operation fail.
DEBUG for information that will help me debug an issue, but is only really useful once I’ve confirmed that an issue exists.
WARNING for things that didn’t work the way I intended, but I had a workaround for them (this is for things like handling API rate limits, for example)
ERROR for exceptions I catch (with stack trace).

We spend a lot of time debugging via println() when we’re in school, but we’re still really bad about putting those debugging statements into our code proactively, probably because in school all our projects are fire-and-forget. Once they’re turned in, there’s no maintenance involved, so we never learn what’s it’s like to have to go back to something weeks or months after we wrote it and figure out what’s going on with some particular user’s experience. That said, it’s an area of my programming that I’m trying to focus on right now, and hopefully it’ll start paying off in the next couple of months.