From the “stuff I screwed up” files – the scourge of the automated retries

General Commentary, Java, Programming, Stuff I Screwed Up, Work Add comments

Jun 302021

We all screw up, it’s part of life. What’s important that we learn from it all. I’ve found it’s generally better to document the learning, both to help internalize it, and to share it with who I can. I recently worked an issue stemming from automated API call retries that didn’t go nearly as well as it probably should have, and I wanted to take a little time to post-mortem on how everything went, and how it could have gone better. A lot of this is fairly specific to what I was doing with that particular problem, but there’s hopefully some general lessons that can be drawn from the whole ordeal.

The problem

The basic problem is this, we have a service using Spring Retry. It’s a useful tool for preventing network blips from causing problems and generally works well. We have a scenario where we send a call out to an external service with a unique request ID to perform a service for a user, and sends an email confirmation on success. For the ideal user experience, we needed to be able to tell users definitively, “This operation succeeded,” or “This operation failed,” (so no Schrodinger calls). The problem was this, in a QA environment, our call to this service timed out. Because we didn’t get an HTTP OK response, we sent a message to the UI that the operation failed, and displayed an error message. So far, so good. We’re calling a QA environment (usually), so sometimes it’s got some issues or there’s a deployment going on. But then…we got the confirmation email that comes from a processing success. Apparently, the service was actually working, but slower than a standard 10-second timeout. In fact, this service got multiple calls to perform the service, which would cause problems for our users in production (in fact, another part of our application that called this service was starting to generate these user problems around this point). So, we needed to make some changes to try to make sure these requests weren’t being processed multiple times.

Fix attempt #1

So our first instinct was to see if there was a way to cancel these calls when they timeout, and explicitly retry. We reach out to the team that owns the service and confirm they have a cancellation endpoint. Success! 1 developer updates our Spring Retry configuration to skip automatic retries on those calls, and I write some logic that catches the timeouts ourselves, then send a cancellation request for our call, and retry it ourselves (with a new request ID). Sounds simple, right? Well, it was supposed to be. In my hotfix, I missed 1 line, to reset the request ID when I resubmitted it. I also didn’t add a log entry to capture the request going out, so when we monitored our systems, we saw that a retry request was going out, but not the details around it. I move the “fix” through the pipeline, and it’s deployed to production during the next push, only for us to continue to see the problem in production. In other words, we put in a lot of engineering effort to…re-create the problem, but this time manually!

Lessons learned

The most infuriating part of that failure was that the line that ould have caught this in development, before it even went to QA (where it passed, because they saw retry calls going out too), was this:

logger.debug("Calling theService with {}", requestObject);

Logging that output would have revealed that the request IDs weren’t different between calls, and this would have been fixed immediately. So the first lesson here is “log everything.” Even if you aren’t capturing DEBUG level logs in production, they can still help expose useful information during development and testing. There are tons of tools out there for filtering and searching through logs, so the only time you should worry about clutter in the logs is when there is there’s information in there that is counter-productive (stack traces from a non-error, for example).

This problem passed unit tests because in my tests I was mocking the call to the service with:

Mockito.when(externalService.processRequest(Matchers.any(RequestObject.class)))
    .thenReturn(myResponse);

This brings me to the second lesson learned from this failure – test all your assumptions. This doesn’t always require an explicit assert statement, sometimes you can test it just by being more specific in the parameters to your mock call. If I had a test that only worked when the first call had the request to the service had the request ID that I created in the initial mock, I could have prevented this failure even if I never added the log statement above.

This first release went out as a hotfix, so there was an emphasis on speed. As a result, I let the existing test that just confirms that we call the service with a request object be “good enough.” I didn’t add any tests that explicitly confirmed that the retry process behaved exactly the way we needed it to. That brings me to the last lesson learned from this stage of the saga – take the time to write the tests, even if you’re trying to push a hotfix out the door (probably especially then – nothing’s worse than a hotfix going out that doesn’t actually fix the problem correctly). Just because there’s a bug in production doesn’t mean it’s a good idea to skip steps designed to increase overall quality.

Fix attempt #2

So I went back, changed the unique request ID to…well…be unique on every request. I updated the logs to confirm it. I updated the unit tests to be sure the retries do what I was assuming they’d do. We’re certain it works, so we ship to production and…it didn’t work. You see, back when we wrote the fix and tested it, we were just testing our service, so we had to mock a timeout response in our unit tests, and I would simulate a timeout for QA by just flipping the success and timeout logic, i.e. giving QA a build that used

if (response.getStatus() == Response.Status.REQUEST_TIMEOUT || 
    response.getStatus() == Response.Status.GATEWAY_TIMEOUT) {
    // Do the success logic
} else if (response.getStatus() == Response.Status.OK) {
    // Do the timeout logic
}

instead of:

if (response.getStatus() === Response.Status.OK) {
    // Do the success logic
} else if (response.getStatus() == Response.Status.REQUEST_TIMEOUT || 
    response.getStatus() == Response.Status.GATEWAY_TIMEOUT) {
    // Do the timeout logic
}

That was successful in confirming that the stuff that’s supposed to happen when we received an HTTP timeout response actually happened the way we wanted, The problem was when we pulled the affected calls out of Spring Retry we stopped getting HTTP timeout responses. We got the SocketTimeoutExcpetions that Spring Retry was catching for us. So the retry logic worked when it was getting called, which was never.

Lessons learned

The biggest lesson here was that there’s nothing like a live reproduction of the issue. In this case, rather than simulate a timeout, the solution was to work with our QA team to create a timeout by using a proxy server we already use for testing to intercept the call to the other service, wait 10 seconds, then forward the call to the external service. As far as our system was concerned, the call was timing out (because it never saw a response within 10 seconds), and the logic could trigger, which we were able to verify was correct. Simulating locally is still good for a quick sanity check, but nothing beats a live failure.

Fix attempt #3

So at this point, everything’s working, we triggered a real failure, confirmed we were sending cancellation requests for each failed try, and that the retry attempts all had unique IDs. Finally, the problem is solved…from our end. It turns out the service we were calling can only process cancellations of completed requests, and the cancellations could be hitting before the original calls causing the timeouts finished processing (the external service was still within our company, but they were calling a 3rd-party provider, and that call was where we saw delays). So we were sending cancellation requests, but they weren’t doing anything. In other words, the service endpoint we thought we could use to address our issues utterly fails in the scenario we were trying to fix. The entire premise of our whole solution was wrong!

Lessons learned

The lesson here was that when trying to use endpoints from another team’s service to try to resolve an error on your end, make sure the other team understands the problem and get confirmation that what you want to do will solve the problem. You don’t have this luxury with 3rd-party APIs but when you’re dealing with other services from within the same company, take advantage of having the ability to talk to the developers.

Overall lessons learned

The first lesson learned is that I need to change how I think about failure. A failure for me isn’t necessarily a failure for the system I was calling. In this case my request completed successfully, just very slowly. What do we do when performance is highly variable? Thinking about this over the course of working on this led me to this answer – the systems that tend to work the best with the fewest issues tend to be built around eventual consistency. Eventually consistent systems aren’t bothered by slowness because eventually consistent designs handle it naturally. If you design your end of the process well, you can treat external services being down as just a longer than normal delay. For example, you could just write the data you need processed into a queue. At that point, you can call your part “done,” (well, really “submitted,” but you get the idea), and you don’t need to wait on a response from external services.

If the other teams in your company don’t like the whole “let’s use queues” idea, you could do something similar yourself, by having the call to your back-end just write the data you’re sending to the other service in a database, and having an offline, background process pull data from the database and send it off to the external service. For the record, a queue is a better solution, because this is the problem queues were built to solve, but using your database as a “queue” works in a pinch when you have to do the whole thing on your side.

Another important lesson is to make sure you understand the SLAs of the services you’re using. If it’s a long period of time, you shouldn’t commit yourself to synchronously waiting for a response. Performance is a feature, and if other services can’t enable you to guarantee performance, it’s best to know that at the design phase, where you can figure out a solution that hides that slowness from users. While you’re at it, if you’re making a call to another service that’s going to do something for a user (place an order, charge money, etc.), understand what happens to that call when the SLA is broken, and what your options are for correcting it.

These next points aren’t really “lessons learned,” from this particular process, but this whole ordeal is a good reminder that when code is broken out into multiple services, a failure in 1 service, does, in fact, impact other services. I know that’s heretical to some of the “microservices or bust” evangelist crowd, but it’s a reality. This was a problem that users noticed, and contacted our product’s support team to get fixed, even though the “problem” was something we had little control over. In fact, it was something we couldn’t really do anything about other than extending timeouts on 1 particular service call, which still left the the root problem (slowness) visible, even if it at least hid the error messaging.

Lastly, this whole episode illustrates the downside of multi-service projects within a company, namely that issues occurring at the “borders” of services are particularly hard to address. Did we own it, because our application was where users were impacted? Or is it the other team’s problem because their API calls were so slow? Do we both try to come up with our own independent fixes on each of our ends? Who owns what the final experience looks like for the user? Do we, because it was in our application? This other service has an API used by multiple teams, should we really be able to dictate a solution on their end? This is a real challenge in service-oriented companies, and it’s not 1 with a good answer. Services have boundaries, and the most likely place to have issues is when these services are actually talking to each other. I’m not sure what other places do to resolve conflicts like these, but it seems worth learning more about.

Debugging bugs that aren’t caused by actual errors are hard, and trying to debug issues that resolve around calling another service can be particularly tricky to reproduce exactly. Mocked calls in unit tests are a good start, but nothing beats an integration test that completely reproduces the issue. But the truth is debugging problems still require understanding all the pieces involved, including what the services you’re calling can and can not do. Just be careful that by the time you get that understanding, you don’t realize too late that your real problem was a design problem.