My experiences with OpsWorks

Nov 272017

Following what’s probably a very predictable course of operational maturity, my team at work started out manually uploading jars onto VMs, maybe with a few simple services, scripts, and setup tricks to keep the manual steps for deploying software to a minimum. As the amount of code we wrote and maintained grew, we started to focus on automating more and more of our code deployments, with an emphasis both on rolling deployments (so there’s no visible downtime to users), and on increasing reliability by reducing the number of steps we could possibly mis-type or forget. Since our code was heavily deployed on AWS anyways, OpsWorks seemed like the perfect setup for us. So far, while it isn’t actually perfect, it’s been a good tool for getting our app deployments and instance configuration more automated, which is what we really needed.

A brief primer on OpsWorks

If you’re not familiar with OpsWorks, it’s basically Chef-enabled EC2 instances. In this case though, AWS has already taken care of setting up the Chef server and the Chef clients on each of one of the EC2 instances you spin up. Given that the first thing you’d be taught to do when learning Chef is to set up a Chef server and client, not having to worry about automation infrastructure is an immediate win. Of course, you’ll want your own custom scripts, but including those consists of nothing more than packaging them up into a file on S3 and giving OpsWorks the link.

OpsWorks is a collection of stacks – you may want to think of these as an individual service (or, in the context of my job, individual applications). Each stack has 1 or more application – which is basically just a deployable unit. Your actual EC2 instances are organized into layers – which is basically sets of instances that have the same sets of Chef scripts run on them. For simplicity’s sake, I generally keep a 1-to-1 relationship between apps and layers, and that’s worked out pretty well for me, but your needs may vary.

The good

First off, like I mentioned earlier, OpsWorks is a good introduction to deployment automation. AWS has successfully abstracted about 90% of all the Chef stuff, and the 10% left is the specific stuff you need to make your particular app work. I know I say this a lot, but not having to manage infrastructure directly is always a huge win in my book. Easy-to-manage and set up infrastructure like this is probably the biggest reason to use AWS in my opinion. Need to scale part of your service out? No problem, just click the handy “Add Instance” button, pick your size, and start it up. Voila, a new machine is up and running, no additional set-up work required.

OpsWorks’s paradigm of putting instances into layers makes it really easy to visualize your deployed services and the resources allotted to them. Between organizing each service into stacks, and each component of those services into layers, you’re left with an intuitive and organized view of your entire infrastructure that’s far superior to just looking at the EC2 console.

The bad

Most of my complaints with OpsWorks read like a watered-down version of this blog post. I don’t think there’s anything “wrong” with OpsWorks per se, but OpsWorks does seem to behave fundamentally differently than how you would typically think AWS is used. As Fabrizio Branca noted, AWS is typically thought of as having a highly mutable, varying number of instances that spin up and wind down automatically based on configuration values. Manually creating instances, even the load-based ones, seems very “anti” AWS. Coming from a position where some portions of our apps were deployed on AWS’s Elastic Beanstalk, having to manually create load-based instances certainly has the psychological feeling of being a step backwards. That feelings exacerbated whenever AWS talks about their newer services like Batch and Lambda, where AWS is abstracting away the idea of having instances in the first place.

Another issue I’ve run into is that OpsWorks instances can be slow to boot up. By “slow”, I mean it takes more than 5 minutes for an instance to go from “time to start” to “online.” When manually starting each instance, that’s really just an annoyance, but OpsWorks does offer “load-based” instances, and those need to be able to come online quickly. Slow boots generally don’t happen, but they feel particularly annoying the 5% or less of the time I encounter them. I recommend setting your scaling triggers to be slightly more sensitive than you would in a normal auto-scaling group, and load-testing any layer you need to dynamically scale to make sure those triggers are tuned correctly.

One of the best things about AWS is that they have an API for everything (read Steve Yegge’s epic Google+ post on just why that’s a big strategic win). As a result, you don’t have to do everything from an AWS console (although it’s really easy to do so), but can instead manage your AWS resources directly from your application code if needed. One of the most basic examples of this would be spinning up EC2 instances, which we happen to do. This lets us define our own specific scaling policies without having to set up and pay for custom Cloudwatch metrics. According to OpsWorks’ documentation, you create instances under a layer, but I haven’t actually tried that yet and can’t speak to how well it works. That said, I did have some experience trying to integrate auto-scaling groups into an OpsWorks layer, and that did not go well (more on that later).

General notes from experience

OpsWorks comes with some monitoring built into it, but really you’re going to want to go with whatever monitoring/dashboard setup you’re already using. It’s going to end up being more granular and relevant than the basic graphs OpsWorks offers. The graphs in OpsWorks aren’t terrible, but they’re so extremely generic that anything else you’re using will offer better insight into your application’s performance.

The whole app/layer thing can be a little confusing at first, especially if you’re using them in a 1-to-1 manner (is there anyone out there not using them like that?). This seems to go away after a while since ethee the app is initially set up I only really deal with creating instances under a layer and never really think twice about the app. Still, it seems like the type of thing that could have been combined into 1 thing, unless mine is just a really simplified use case.

I did try to integrate an OpsWorks stack with an autoscaling group at one point, but I found that it took too long for the instances to start up (by the way, instances running Amazon’s flavor of Linux tended to start up the fastest, but even then, it was too slow). Specifically, what was having to happen was the autoscaling group was creating an instance, then was having to associate the instance to the stack, then once the instance was associated with the stack, associate it with the OpsWorks layer. I had the autoscaling group set to ignore the trigger metrics and the new instance for 5 minutes after an autoscaling event is kicked off, and that still wasn’t enough time to get the new instance booted, associated with the stack, and associated with the layer – let alone get the Chef scripts run. There’s probably a better way to do this using Lambdas and OpsWorks’s API, but I didn’t really have time to explore that option.

My OpsWorks wishlist

I think OpsWorks is a good service, and I’m sure AWS is working on some marquee improvements to the service, I have a couple of things I’m hoping they announce soon. First, I’d like to see the deployment automation supported by OpsWorks open up to more than just Chef (oh hey, look what AWS announced recently!). There’s only 2 teams in my company that use Chef, and we’re having to use it because that’s what OpsWorks supports. Every other team at my office deploys using Puppet. Google’s a useful tool, but it’s also nice to be able to ping someone on chat for assistance, so the added Puppet support is a nice boon for us.

Next up, I’d like to see the average time from an OpsWorks instance starting to being online speed up. I’m sure that’s the type of thing AWS is working on all the time, and isn’t likely to be accompanied by a big keynote announcement or major blog post, and in general OpsWorks instances are up and running fast enough, but there’s still that edge case of a slow boot time that when you hit it, makes for an extremely frustrating experience. That edge case may never hit 0, but the rarer it is, the happier I’ll be.

Lastly, I want to see autoscaling be added as a true first-class citizen into OpsWorks, similar to how it is in Elastic Beanstalk. “Pure” AWS would mean that I don’t really know what instances are running, just that there are some online and serving my application. In an ideal world, I wouldn’t need to think about a specific instance unless I was trying to find an instance that I need to kill because it’s acting up, and to be honest, that sounds like the type of thing that can be outsourced out to a script that can look that sort of thing up for me. This alone would solve the biggest drawback I’ve found with OpsWorks.

Overall, OpsWorks is a good managed deployment automation service. It does a good job of letting you focus on just the scripts/configurations that get your app going. It’s not perfect, but it does make getting starting with automating your deployment a lot easier by taking care of the server/client setup for you. The documentation around OpsWorks is pretty solid, as one would expect from AWS services, which makes getting up and running on it pretty straightforward. The biggest drawback to it is just how poorly it incorporates automatic scaling, something that’s supposed to be an area where AWS really shines. That said, I am publishing this going into AWS re:Invent, so it’s entirely possible this could be getting addressed in the very near future. In the meantime, I still recommend using it unless you really want to get into automating the process of building your own AMIs and updating Elastic Beanstalk apps or auto-scaling groups yourself.