From the “stuff I screwed up” files – caching was a bad choice

Programming, Stuff I Screwed Up, Work Add comments

Jun 302023

OK, so the phrase “caching was a bad choice” isn’t exactly something you expect to hear about software development. Generally, caching is a good thing – it improves performance by reducing calls to your database or external services, saving time and resources by not re-querying or re-computing. What’s not to love? Well, as with everything else in life, the devil’s in the details, and if you don’t get those right then anything can become a failure.

The background

We have a page whose content is a combination of data from multiple sources, each piece of which is acquired via a series of sequential, synchronous REST API calls. The calls themselves are reasonably quick each, and the the processing involved in collating the data from the various sources isn’t bad, but all in all it still takes a few seconds to load the page. In the interest of improving performance, we did the first thing anyone does – throw a cache in front of it. To do this, we used Ehcache’s heap cache. Before I got too far along implementing this, I had a meeting with the team that ran our application servers to check the memory on those servers. No problem, there’s 10’s of gigs of memory on those servers. Great! I started slow, with just a 1GB heap cache. We deploy, the application keeps working, “mission accomplished.”

More caching!

Over time, I had noticed from the debug logging we had in our test environments that it seemed like a lot of stuff wasn’t getting added to the cache due to space constraints. In addition, we were needing data from these individual REST calls in other places. While implementing that latter change, I decided to cache the individual components of these REST calls, and just have this aforementioned page that relied on caching build its contents from the individually cached components. The thought process was, at least some of the data used to build this page would be cached, and we could reduce the number of times we had to wait on fetching all the data for this page. Plus, this data could be sitting in the cache for the other places where I’d end up needing it. Win-win. While I was in there, I updated our application settings to bump the cache up to 4GB, there were “10’s of gigs of memory” on those servers after all, and we weren’t using them for running the application.

Deployments to our various testing environments went fine, the application ran fine, everything looked good. Time to go to production, a deployment that ran without issue. Moving on!

A (random ?) server crash

A few days later another service was experiencing intermittent connection issues, and the ticket happened across our team as support was trying to figure out who exactly owned the service in question. Eventually the service went down altogether causing an “all hands on deck” situation to work out exactly what was going on, and what the root cause was. It turns out that while we didn’t own the service in question, it was running on the same hardware as our application, which ran out of heap space and caused the CPU on the server to spike. 1 reboot later, and everything appears to be working. The team running the application servers starts working on getting a heap dump to us for analysis and we all move on…for about 3 days.

3 days later, the exact same thing starts happening again. Same service starts showing issues, same root cause. We still don’t have a heap dump thanks to unrelated issues in getting the files to us. So at this point we’re guessing but some examinations on the app server cause the team administering it to note that the word “cache” seems to be coming up. At least at this point we know what the pattern coming up for this issue is and the team administering the server is able to put together an alert so that they can reboot the server as soon as this starts to appear while we investigate and resolve the underlying issue.

Remember those gigs of memory?

Around this point I’m able to get a heap dump and during a call discussing the recurring issue (which happens every 3 days or so, surprisingly regularly), I mention that there doesn’t appear to be a major culprit, as the largest individual item in the heap was only 500-something megabytes That’s when someone from the team managing our app servers mentioned that applications on that server only have 2GB of heap space.

Wait a minute – what happened to “10’s of gigs of memory?” Well, there was, but not all of it was available as heap. Multiple teams had services running on that hardware, and each application was limited to 2GB. Well, you can see how that might cause an issue when caching is suddenly expanded. OK, the memory available for the cache wasn’t as big as expected, so we’ll just revert back to the 1GB we initially had. That seemed to work just fine. 1 hotfix later, and we’ve deployed to production. Problem solved…for about 3 and a half days.

The issue was still happening. Apparently our application didn’t run just fine on 1GB of caching after all, we just didn’t notice because we put so much data in individual entries that we weren’t able to actually cache much, preventing it from ever filling up. As a result, our excessively-sized cache never caused an issue before since we weren’t actually using it. At this point, the writing was on the wall – we didn’t have enough heap space to make an in-memory cache worthwhile so we decided to just pull the cache, and to consider ways of improving performance elsewhere. I make the change, and actually stay fixed this time.

So, what did we learn from this?

Well, first of all, not all memory is created equal. We should have confirmed whether the memory available on our servers was heap for our application. That would have told us that space is limited and likely prompted us to drop caching and look at trying to figure out other options for building this page faster.

The next lesson is that we had insufficient observability with our cache. I had seen some messages in a debug log indicating that items failed to be put into the cache, but we don’t log at the DEBUG beyond QA. I should have been capturing enough data around the cache and its usage to be able to track hit rate, utilization, and number of times items never got put into the cache at all due to size constraints. That would have let us know upfront that the original on-heap caching wasn’t doing anything.

Asking the right questions up front is hard. But we could have mitigated that failure with making sure we were asking (and more importantly, answering) good questions during our implementation. Capturing usage would have told us not only “does this caching even work?” That could have led to more testing that would likely have ultimately revealed that there just isn’t enough heap space to cache data in-memory. That last point especially burns, but is a good (albeit painful) reminder that “checking your work” means more than just “Does it run?” It also means verifying “Does this actually solve the problem?” – especially in production.