When is running in the cloud worth it?

Nov 302023

Running software on cloud providers certainly is convenient, but it’s also really easy to run up the associated bill if you’re not consciously thinking about costs (something most of us, myself included, don’t do often enough). In the vein of “DevOps,” this has led some companies to continue the trend of “taking a term meant to emphasize having actual cross-functional teams, and slapping it on something utterly unrelated,” leading to the rise of something called “FinOps.” Basically, FinOps, or “Financial Operations,” is about incorporating business concerns (namely, cost) into the development process. The official site may say that it’s a “cultural practice,” but we all saw how well that worked out with “DevOps.” This is the type of thing that gets people to start arguing that you should get out of the cloud, (because you can save tons of money). Is that worth it though?

The financial side of the coin

The economic core of the financial argument of “to run in the cloud or not run in the cloud” comes down to something called capital expenses vs. operational expenses (also referred to as CapEx vs. OpEx). In short, capital expenses are a huge amount of money up-front, but then you get value from whatever you spent that money on for years and years, so you make the money back from savings over time. In fact, when you read about DHH talking about how much money they’re saving, he bases a lot of his calculations over a 5-year period, even though they’ve already bought their physical servers at the start of their cloud exit.

But wait, isn’t the whole selling point of running in the cloud that you can save money over running on-premises? How does that work if people are saying they’re actually saving so much money by running on their own hardware? Well, there’s a variety of factors that go into what’s cheaper. The first is how consistent are your resource needs? Running in the cloud excels when the resources you need vary, the more the better. You can still save some money if you reserve instances, but it’s not as big as just not paying for hardware you’re not actively using. But the driving force of that savings is the fact that you’re not spending hundreds of thousands to millions of dollars a year updating hardware (presumably, you’re regularly replacing at least some of your infrastructure every year). Instead, you just pay significantly less than an amortized server payment to run a VM or container when you need it.

Capacity planning is another important factor in the “where should our servers live” debate. The better you’re able to predict the amount of work your system is going to need to do means you’re better able to decide who owns the servers doing that work. If you choose to run in the cloud, you can accurately budget your cloud bill. It also enables you to accurately plan what server resources you’ll need on-premises. Where running in the cloud does have an advantage over your own physical servers is that running in the cloud is much more forgiving of your capacity planning being too low. You pay extra for that correction, but the compute resources will at least be there. Running your own hardware, you have to err high which means it’ll take longer to make your money back from the investment.

The advantage of running on-premises is that once you have the servers, scaling up using that hardware is free – you’re just adding VMs/containers to servers you already bought. In the cloud, every act of scaling up your infrastructure also scales up your bill. This is the heart of the whole financial debate behind running your own servers or using servers in the cloud. This is also why capacity planning and workload requirements are so important. The more accurately you can plan your workload, and the smaller the range between your low and high ends, the better owning your own hardware is for you. The harder a time you have forecasting your needs and the more wildly they vary, the safer it’ll be to bet on the cloud.

How efficiently your code runs influences this too – running in the cloud makes it very easy to throw hardware at your problems, especially with the increased availability of managed services, where you don’t need to specify or reserve instances, and just effectively pay per invocation or request. While this can make it easy to run up your cloud bill, it can also gives you the flexibility to handle sudden growth long enough to refactor your code for your new scale. On the other hand, the more you work at making your code efficient and lean, the better you’ll do on-premises, since you’re already very good at getting the most out of existing computing resources. You can easily convert this sort of efficiency into cloud savings as well, but doing more without having to buy new hardware generally yields more value from on-premises infrastructure.

The technical side of the coin

Of course, not every decision is driven by money. There are a lot of technical considerations that go into the decision to use cloud infrastructure instead of physical servers. First and foremost, having your own servers means running your own servers, which means operations. Do you have operations people on your team? At the very least, do you have an operations group that’s responsive to your development team’s needs?

Tooling around running applications on servers has improved dramatically since people started deploying to the cloud (ElasticBeanstalk used to be a major improvement over anything you could do on-premises). This applies not just to cloud infrastructure but also physical infrastructure. The big advantage to running in the cloud, particularly with VMs, is that cloud infrastructure encourages thinking of your infrastructure with a cattle mentality instead of a pet mentality, but with containers you can now bring that mindset back to your data center.

If you don’t have operations professionals on-hand, the good news for you is that running in the cloud greatly favors development-heavy teams. IaaS lends itself to infrastructure as code, with your cloud provider handling the actual running of the hardware. Add in managed services, and you don’t even need to worry about actually running a lot of this stuff. You’re paying more over the long period, but that money effectively contracts an operations team for you if you don’t have anyone to fill that role on-hand. The “we handle operations for you” benefits also apply if you do have an operations team, but it’s silo’d off and working with them becomes a bottleneck. Write up some Terraform, run it on a cloud provider, and you’re “doing DevOps” (narrator: “Not really.”).

Going back to the point of running in containers enabling a lot of the cloud benefits back to your data center – that fact has really upended what running software looks lie, and driven a lot of changes that have brought a lot of the technical benefits of running in the cloud to your data centers. Specifically, container orchestration software has really helped on-premises catch up to the cloud. We’re just as capable of managing containers as we are managing VMs. The benefit here is that the software for running containers is open source and available to install on your own servers. Orchestration tools like Kubernetes and ECS (and their non-AWS equivalents) offer a lot of the benefits that VMs had with tools like ElasticBeanstalk. Because you can install software to manage those containers in-house, you can embrace running containers and get a lot of the technical benefits of running in the cloud, without the cloud bills. I don’t seem to be alone in this belief either.

Neither making use of container orchestration nor running in the cloud absolves organizations from needing operations expertise, it just changes the focus of that expertise. In the cloud, that operational expertise needs to focus on keeping up with all the various components that cloud providers offer, how they fit together, keeping track of all that in your infrastructure code, and using all that information to help manage your cloud bill. On-premises, that operational expertise focuses on maintaining the container orchestration cluster, helping developers configure their projects to be deployed into your cluster (have you ever tried to wrap your head around Helm charts?), and ensuring that applications and services have the tooling needed for developers to run the entire software in production.

Making the choice between on-premises and in the cloud

Before you can make an intelligent decision about running in the cloud, you need to understand your workload and be able to accurately predict it far enough in advance that you could buy new hardware, get it delivered, and get it installed in your data center by the time you need it. This also helps with running in the cloud, not only is there the obvious savings of reserved instances, provisioned services tend to be cheaper than on-demand. So first things first, you need to understand your current load, and how it fluctuates, so you can predict it.

Once you’re able to gauge your needs, the next thing to consider is whether you have dedicated operations people that can manage your infrastructure, wherever it’s running. Without them, you’re better off running in the cloud, even if it’s more expensive. Ideally, the operations engineers responsible for a team’s infrastructure should be on the team itself, but that’s a whole other set of rants. But the gist is this – if you don’t have people who can manage the servers your applications and services run on, you shouldn’t be in the “running on your own servers” business.

The next question is how consistent is your workload? The more your infrastructure needs fluctuate, the better the value you can get from just running in the cloud. The emphasis here is on total workload – individual components can fluctuate, but the emphasis here is on total cores and memory. All your code is sharing the same infrastructure, so you’re focusing on bottom-line needs. Service A scaling up while service B is running normally and vice versa really enables to you get benefit from shared hardware regardless of where it’s racked. Again, if the total requirements don’t vary much, that racking cost is cheaper on premises.

However you’re leaning in the “cloud vs on-premises” debate, you’ll want your code being packaged and run in a format that lends itself to running anywhere. That means avoiding cloud-specific stuff like building for “functions as a service,” or going all-in on managed services. These are useful tools to outsource operations expertise, and handy if you’re committed to going all-in on the cloud, but are also big pieces of lock-in should you ever realize that you could be saving money on-premises.

It’s easy to view the debate about running in the cloud or your own data centers as being a purely economic argument, and there are significant points to be made in terms of the money involved. But there are also still technical benefits to running on IaaS providers that shouldn’t be ignored. The best thing you can do is get good at monitoring your compute usage, and developing the ability to predict resource needs. The second best thing you can do is eschew the managed services in favor of deployable components that can be run in the cloud or out. This gives you the flexibility to run anywhere, but relies on having operations specialists actively working with you – either as a platform engineering team or on your team as part of a DevOps culture. What’s important is that both options are actually viable now, and there isn’t really a “wrong” choice these days. Like everything else in development, it comes down to which variables are you going to try to optimize for, and which ones are you willing to sacrifice to do so.