Sep 302016

Recently, we had a Kinesis consumer back up due to a deployment problem. Sadly, this problem went on for 2 days without us noticing, so we were pretty backed up. Now in theory, we should have been able to catch back up to real-time data sometime later that calendar day. In reality, we fell 2 days behind and it took us 2 days to catch up. That’s not acceptable, and I wanted to document the list of things we tried and how well they worked and to document our process for looking for the bottleneck. Obviously, there’s tons of room for improvement in the code, but there’s also a lot of room for improvement in how we were doing things before, and probably a lot of room for in how we went about trying to fix the issue.

The inciting incident

I had recently taken over an application that’s running on AWS OpsWorks. I hadn’t worked with it before, but I had a couple of changes ready to deploy that affected 2 layers of the project. I thought there was an option to deploy the whole application, all layers at once, so I ran it. When nothing seemed to happen, I shrugged it off and deployed the 2 layers individually.

What I didn’t realize at the time was that I had stopped all layers of the application, although the 2 most visible ones were re-started when I deployed them individually. The issue was that these layers handled all our event processing, emailing, and stats calculations for the application. The absolute worst part is that this was down for the better part of 2 days before anyone realized there was a problem and I re-started everything. I know post-mortems are supposed to be blameless, but this issue was totally my fault.

The good news here is, the system started recovering as soon as the other layers were restarted. The problem here was it was recovering slowly – very slowly. We had 1 service that consumed data from the same event stream as this application that fell behind a day during bad upgrade that recovered in about 1 hour. Our consumer was processing old events slightly faster than new ones were coming in – meaning it would be several days before we made up our 2-day gap.

Debugging the bottleneck

One of the first things we checked was for different Kinesis consumer settings between our consumer and the other services, but they were the same. Specifically, they were the Kinesis consumer defaults. So from this, we could conclude that the issue wasn’t with the settings or the connection to Kinesis. That was working fine on a host known to catch up quickly. Still, we did reduce the minimum time between calls from 1 second to 250 milliseconds to try to speed things up (by attempting to pull records from the stream 4 times per second instead of once per second). That didn’t seem to have any impact. Our issue was definitely not in the configuration, so now it was time to start looking somewhere else.

Next up we tried splitting the shards (2 consumers had to be better than 1), but we still had to wait for the data from the single shard to process through it’s 1 consumer before we could take advantage of the split shards, so while we had some benefits coming, we still had to slog through our existing backlog.

By the way, I did see consumption seem to catch up faster between about 5:30 and 8:30 PM most evenings, but I don’t have any definitive reason why. This was the time period when I was leaving work, going home, having dinner, playing with my son, getting him into bed, and getting ready for the next day. If other people were as busy during this time window as I was, then I’m guessing there were just fewer events being generated. Since I’m basing this on personal anecdotes though, there’s no guarantee I’m onto anything, but it was an interesting observation.

In the initial implementation of our Kinesis consumer, we processed each event individually. Our next step was to try to batch the writes to see if they’d help. Any impact we saw on the performance of our consumer was negligible. Batching up writes was probably the right thing to do, but also not our biggest performance roadblock.

Around this point we starting looking more closely at our database usage. We were seeing throughput in the neighborhood of 130 records/second. Since this database was hosted on Amazon RDS, that’s not even close to stressing the database. We may not have been the most efficient with how we were handling the database, but it was more than powerful enough to handle even our inefficiencies.

Around this point, we had ruled out hardware that couldn’t keep up, Kinesis consumer settings, and not batching our writes to the database, but we were pretty certain that our consumer code needed some touching up as the source of our woes. Now that we were circling the issue, we could start making tweaks that may just move the needle on our performance.

1 of the last-ditch efforts we made was to implement a quick, “good-enough” cache of reads from our local database. Basically, a hash map of IDs to objects that encapsulated was contained solely in the method that processed a batch of Kinesis events. I’m not sure how much that helped, but we were doing at least 2 database operations for every event, and cutting down on that and keeping things in memory could only help.

Around this point, the single stream I sharded had been processed and now we were also up to 2 consumers reading the backlogged events. This was where we started seeing the consumer really catch back up. Given how close this came to the caching in the event consumer, I’m not sure which had the most impact, but the good news is we were catching up at a reasonable rate. After 2 days, we were finally caught up and it was time to switch focus from firefighting to long-term fixes.

Least Valuable Performer

At this point, I’d like to award the Least Valuable Performer for this whole ordeal – and that was our deployment pipeline. We had a deployment job on a Bamboo server to push the various components of this application to the OpsWorks layer. Somewhere in in this process our release was getting screwed up badly enough that sometimes what was actually deployed was in fact an earlier jar, and we had to perform subsequent deployments to try to get it to somehow “find” the latest update and make sure that was live.

After the catch-up

Now that everything is back to normal, we’ve merged the Kinesis shards back together and started focusing on the underlying issues that plagued us during the catch-up period. First up, real caching in our Kinesis event consumer. A HashMap was a simple, “good enough for now” solution, but there was the potential of out-of-memory errors with an unbounded data structure. Luckily, 1 of my co-workers implemented a basic LRU cache service for another 1 of our libraries, so it was just a simple matter of generalizing that and bringing it in.

While we were at it, we needed some logging around our cache so we could monitor our cache hit and miss ratios. This way, we could be certain whether or not our caching was working, as opposed to just hoping for the best. From a code standpoint, the lack of caching was our major bottleneck. Not only were we able to solve it, but we could see the performance improvements and monitor them going forward. Given the fact that we had little to no visibility in our catch up progress, every performance metric we can watch during a catchup period is a big win.

Next up was fixing whatever glitch was in our deployment pipeline that was preventing us from reliably deploying the intended jar to OpsWorks. This turned out to be an issue in our build configuration – we weren’t forcing clean builds (why is that not the default option?) and that was resulting in the packager packaging all the versions of the jar on our artifactory during release – the jar that was running was whatever jar happened to be packaged last. 1 simple config change later and we were back on track on that front.

The biggest issue is that we needed better alerting. We found out about this issue days after the fact, and only because customers asked us what was going on. We had events failing to process for 2 days – 5 minutes of that should have resulted in a hobbit running over from the Shire with his head on fire screaming that we had a problem that needed dealing with right freaking now. That’s since been taken care of, so now we’ll know if there’s any issue on that front. After all, step 1 in dealing with falling well-behind in processing data – don’t fall so far behind on processing data.

Another, longer-term project would be to move our logs into to something more searchable and most easily monitored. Right now we’re using Cloudwatch for all of our logs, and it’s good enough for the basics, and offers some simple dashboarding, but it falls apart pretty immediately for everything past that. Searching Cloudwatch is rudimentary at best. The dashboards are OK, but it’s hard to view a dashboard for the same window as the logs I’m looking at. Several other engineering projects use the Elastic stack, and it’s something we’ve thought about migrating our logs to, but there’s been some concern about how much we’d write to the logs given how memory-intensive the whole stack is. But, that’s a concern for another day. Today, our logging and notification systems work well enough for now.

This backlog and the subsequent recovery exposed a lot of issues which we’ve since resolved. Now we’re processing events from Kinesis at a rate of 2-3 times that new events are coming in. For a database-intensive event processor, that’s not too bad. We had a few human errors deploying our cache, so our consumer update took 30 minutes instead of closer to 5. Once we got everything up and running, we were caught up with the event stream in 30 minutes.

Ultimately, we got a few good lessons out of this. The first, and most important, is that we can’t afford to skimp on notifications. As soon as we stood up the application, we should have made sure we had notifications for any catastrophic situation that would theoretically come up. We would have been on this issue minutes after it first appeared, which would have made catching up infinitely less painful.

The second is that we can’t rely on hardware to make up for inefficient design. Our original ingestion code was good enough to keep up with events in real-time, but not much else. We should have been planning for batches since the beginning. That way, catching up would have been a lot faster from the start. There will eventually come a day where we need to shard our input stream to keep up with the volume, but now when that day comes we’ll have already done everything we can to the code.

This whole backup then catch-up thing was frustrating, and painful. But we learned a good bit from the experience, and our software is considerably better for us having gone through and dealt with these problems now instead a particularly busy time of the year (like Black Friday and Cyber Monday). Hopefully, some of the issues we ran into will get you thinking about your code, and fixing some of these things (like insufficient monitoring not batching writes to disk, or not caching your database reads) before they become problems for you too.

 Posted by at 10:00 PM