The case of the dead dispatcher

Jul 232015

A while back at work, we noticed a periodic issue where remote jobs just weren’t being run. That was being caused by the fact that the jobs weren’t actually being sent to the remote AWS instances that were meant to be running said jobs. Just why the jobs weren’t being sent out was a mystery, but a simple bouncing of either the remote servers or the main application itself seemed to get things moving again. In the meantime, we were off on a hunt for just why these jobs were no longer getting dispatched.

The setup

The architecture in question looks like this: we have a main application server, and several tasks that are resource-intensive and/or need to be run very frequently. Rather than scale the main application server, we instead run all of these tasks on remote machines that we can spin up and spin down as necessary. Our app uses Quartz to send these jobs out to the remote servers periodically, as defined in our configuration. For jobs that run every 1-2 minutes on remote servers, this functions as a “heartbeat” to make sure they’re still alive an running, for other jobs, this dispatcher actually sends them out to be run (for infrequent, but CPU-intensive jobs that don’t need to take up capacity on a remote server the rest of the time, for instance).

The problem

The very first indication of a problem was when a customer contacted us to let us know they weren’t seeing the results of these running jobs. We checked the status of the jobs, and sure enough this customer’s jobs hadn’t run for about 2 days. We checked the logs, but didn’t see any obvious issues. Trying to learn more, we added some additional logging, updated the snapshot we were using as the source of our job machines, and bounced all the remote servers so they could be rebuilt using the updated snapshot. All of a sudden, the jobs were running again. Well, OK, at least we had some extra logging turned on to try to see what’s going on if this ever happens again.

A while passed, and sure enough, a customer reached out to us because jobs weren’t running again. Our added logging wasn’t giving us anything, jobs just flat out weren’t running. Continuing to dig through the logs, we eventually came to realize that our dispatcher code just wasn’t sending jobs out to the remote servers. From what we could see when looking at the status of the remote jobs, the jobs that were already on the remote servers appeared to be running just fine, meaning our problems were stemming from trying to put new jobs on remote machines.

Initial theories

We were using Quartz for the scheduling and running our dispatches, which led us to the worrying possibility that maybe there was a bug in Quartz – made more concerning because we were using it in a similar manner for another application. However, that other application had been running for a while with no such problem, so the odds of the issue being Quartz was slim, but given the situation, there was a chance there was an error that wasn’t surfacing until somewhere deep in the bowels of Quartz logging, so we set our production logs to DEBUG to try to capture said hypothetical errors.

Another possibility was that there was a deadlock somewhere with the database we were using as a job store. If that was the case, we were hoping our debug logging would help expose that, although we weren’t totally sure how we would be seeing a deadlock with our job store, since it was in the same database that the rest of the application was using with no issue.

Further investigation

Days passed, and the issue reared its ugly head again. We eagerly checked the logs, hoping for a new lead, and saw…nothing. The good news was that we were able to confirm there wasn’t any error being swallowed up by Quartz. Bad because there wasn’t an explicit smoking gun we could pinpoint in the logs. However, the lack of data in the logs proved to be most informative. We were able to confirm that when the issue started, Quartz stopped picking up triggers during its periodic checks. In fact, we were able to confirm that Quartz had stopped running its status checks altogether. Since Quartz randomly not picking up job triggers is the type of thing that a) we’d have seen long before now, b) would have been found by the many other people using Quartz for their job scheduling and execution, and c) fixed because that’s a pretty big bug for Quartz, we realized the Quartz’s failure to collect job triggers was a symptom of some other problem on our application server.

The culprit

In the end, our deadlock theory was right. What had wound up happening was that 1 of our remote processing boxes was thrashing and couldn’t respond to the RESTful call checking on its status. As a result, Quartz was hanging waiting to see if a job machine had capacity, and thus stopped dispatching jobs. This explains why things would work after bouncing servers – either we’d kill the problematic remote server or restarting the main application would restart Quartz, removing the deadlock. It also explained the lack of log entries – deadlocked code doesn’t write to the logs.

The reason why my code deadlocked checking on the status of the remote server was because I didn’t have a read timeout set on my URLConnection. Java doesn’t set this value by default, so if your server hangs responding to a request and you never set this value, then you’re code is going to hang too.

The solution

Luckily, for all the stress and aggravation this problem caused us, the solution was simple and straightforward – just set a read timeout when creating the connection to the remote server. This was a pretty frustrating bug to run down. That said, this bug did reiterate my belief that the odds of a third-party library being the source of your bug are near-zero – it’s pretty much always your code. It was also 1 of the rare instances where better logging didn’t help debug the issue, at least not directly. We had to assume the lack of logging was a lack of logging in our code – not evidence of the problem. That wasn’t good, at the very least our logging got a little bit better out of the situation.

In the end, I learned an important lesson about making sure read timeouts are set on my URL connections. My sense of what I should make sure I included in my logging messages improved, so insufficient logging should be less of an issue going forward. I really don’t want to say that the lack of logging should be considered a sign of anything, usually all a lack of information in the logs means is that your logging is insufficient. Adding more logging messages and changing our log level to DEBUG was the right call to make, even if there wasn’t a smoking gun or stack trace in the logs this time.

It seems a shame that the biggest takeaway from all of this is to make sure a field is enabled in a Java object, but in the wrong situation that minor detail is a pretty severe problem. Here’s hoping my pain and suffering helps you write better code and avoid this issue when coordinating with remote systems in your own software.