September 2016 – Software and Other Shenanigans

Recently, we had a Kinesis consumer back up due to a deployment problem. Sadly, this problem went on for 2 days without us noticing, so we were pretty backed up. Now in theory, we should have been able to catch back up to real-time data sometime later that calendar day. In reality, we fell 2 days behind and it took us 2 days to catch up. That’s not acceptable, and I wanted to document the list of things we tried and how well they worked and to document our process for looking for the bottleneck. Obviously, there’s tons of room for improvement in the code, but there’s also a lot of room for improvement in how we were doing things before, and probably a lot of room for in how we went about trying to fix the issue.