I got 99 problems, and environments are all of them

May 312021

We’re all used to dealing with different environments in our code. On the surface, this just means that we need to make sure our code runs fine in all of our environments. But what about the other services that your code is consuming? If your reaction is that it’s the responsibility of the team writing and running that code to make sure that it’s up and running, you’re living in a fantasy. At some point, you have to write (and maintain) your side of the connection to those services, and if something goes wrong, you have to be able to show the issue isn’t on your end. Oh, this is assuming an exact 1-1 match between your environment and their environment (for example, your dev to their dev, your QA to their QA). Even if the external environment doesn’t use the same naming conventions as you, having each of your environments connect to 1 (and only 1) environment of an external service simplifies things exponentially. If people are asking that you update your code to point to other environments of your dependencies, your life gets complicated fast.

The problem

Like I mentioned, environment management isn’t that bad if everything is consistent across the same environment. In other words, service A in the QA environment only consumes data from service B’s QA environment. In that case, because each of your environments only ever talks to 1 possible environment for external services you consume, and that environment you consume from rarely changes, your biggest concern is just making sure your code works as intended. Where the problems start is when service A in the QA environment used to call service B’s QA environment, but needs to start consuming data from service B’s development environment for the next week, and then switch back.

In my case, service A is a Java Spring application, with properties files for each environment it runs in. There’s also additional information outside our control (things like IDs/secrets for external APIs) stored in environment variables that I don’t see or control. Those change for each environment of each external service, so I can’t re-use those environment variables, and instead have to bother another team to get those changed.

While we’re at it, there’s a UI for service A that also needs to call service B, because we get an iframe from them. Oh, actually, there’s 2 UIs, because we’re in the middle of a major overhaul, but we’re loading iframe data from service B in both. Those iframe URLs are environment specific. So, that means pointing service A to environment 2 of service B involves changing and making a build of 3 different projects, and coordinating with external teams to modify environment variables on the server.

It’s worth noting that service A is running on-premises, but we do have other, ancillary services running on AWS and are interested in migrating our existing code to the cloud. That’s not customer-facing feature work though, so it’s not a priority for anyone involved in deciding what stories need to go into sprints. And us being in the middle of a huge migration of our front-end isn’t any other teams fault, but it is something we have to contend with. All-in-all, 1 of these environment switches represents most of a day, once I got used to the various environmental issues that pop up. It’s really a whole day once you factor in the fact that every time we do something like this there’s a proper smoke-test by QA to independently confirm everything’s set up correctly before whatever team asked for the change can actually use it.

Quite frankly, there has to be a better way to do this, and I intend to use this to organize my thoughts around the acceptance criteria and general architecture of what that better way should look like.

The constraints

First and foremost, developers have very limited direct access to the environments themselves. For cloud deployments, we can modify the Terraform script that builds the environments, but that’s the closest we come. That means for any environment change to support changing the environments we consume data from have to go through the deployment process or be done for us by someone on another team.

The second major constraint is I don’t get to make decisions for other teams. I may be able to make a case for something on our team, but given that these problems revolve around using other people’s services, I need to focus on my end of the metaphorical chain.

Another constraint is that there’s a lot more flexibility with cloud deployments over on-premises deployments. Obviously, physical hardware is limited to what we have in a data center, so the environments we have are a) fixed and b) limited.

The last constraint is that we’re moving toward the GitHub flow model internally, so anything we do to manage environments will have to be consistent with that. I found this constraint has had a bigger impact in how I approached this problem than any other concern.

The problems

Thinking about what could be done to make this easier, there’s a few statements I’d like any solution to remove from circulation:

Is anyone using environment {X}?
Team/Group {Y} needs environment {X} from {A} to {B}, don’t touch it until after they’re done
Can you make a build for environment {X} that points to service {Y} in {other environment that isn’t X}?

I hate problems 1 and 2 in a cloud context because the whole point of running in the cloud is that it’s easy to spin up resources and spin them back down when you’re done. Given that we manage cloud resources with Terraform, and it’s even less excusable to be unable to run something due to an environment being unavailable – spin 1 up and do what you need to do.

Problem number 3 isn’t specific to any code deployment system per se. In fact, it’s mostly in how code configuration works in Java. Generally speaking, we include configuration options in configuration files in our codebase (which is nice because we have version control and history on them). The downside is, changing a configuration option requires a pull request and a full rebuild before you can push it. The problem here is that this change really shouldn’t take a full “build” – nothing about the code or the logic is changing, just a configuration option. We can get most of the way to solving this problem by making sure our configuration files always check an environment variable for the value, and only hold default values for when the environment variable is left empty. If your code only checks the environment variables in question at start up, you’ll still have to restart the service, but at least you don’t have to make a branch and run a full build first. Configuration changes should really only require a re-run of the deployment, not editing the code.

Possible solution – service registry

Before getting into solutions, let’s define a working example that’s analogous to what’s going on. I’m going to focus on our cloud deployments because that’s where we’d have the greatest flexibility. Right now we have a fairly standard environment “list:”

Development (unstable)
Development (stable)
QA
Various pre-production environments (for things like staging, load testing, etc.)
Production.

It’s worth noting that these environments are spread out across multiple cloud accounts. Development (unstable) in 1, all other non-production accounts in another, and then there’s production. From a network standpoint, the basic rule is that non-production accounts and environments can all talk to each other (so any version of my code in outside of production can interact with any version of a service outside of its production environment). This is the part that’s caused every single problem I’m trying to figure out how to solve.

The obvious (and most likely best) solution here would be using a service registry of some sort. All the examples I saw looking for a good explanation mentioned needing to know the IP address and port of the service, but the reality we’re all calling load balancers via hostnames these days. Where a service registry would help us would be mapping accessible versions of the services we could call with the details needed to make calls against them. The trick then becomes engineering how we get details from this service manager so that we can change which version of a service we call without having to do an entire rebuild and redeploy, but instead just (in the worst-case scenario) change an environment variable and re-start. We wouldn’t be rebuilding the code, just re-running a deployment with an updated environment variable value added.

The downside to this is that I don’t make decisions for other teams (I don’t even make many decisions for my team), so I can’t compel anyone to set up a service registry or make other teams register their running services with it. At most this is something we could set up and manage ourselves. That’s better than nothing and, once we get environment variables we look for to use the various production services we interact with, at least makes switching which versions of services we consume something we can do without having to interrupt other people, but that’s not much.

Most importantly, this idea does nothing whatsoever to solve problems #1 and #2. Environments can still be tied up and block work. It’s a partial solution that would require other work on our part. There’s also the issue of sensitive values we usually used environment variables to hold. In theory, these values can be stored alongside the rest of the environment-related data, but there’s likely some security concerns related to keeping that data with other, less sensitive data. We could likely use a secrets manager to handle that part for us – storing only the keys we need to look up and pass in as environment variables to the running process.

Possible solution – configurations in the database

We’ve had some luck putting some custom configuration overrides in the database, so in theory we could make that the standard practice for all our configuration data. Pulling all configuration data from the database would, in theory, enable us to make changes to the project setup without having to redeploy, or even restart, anything to pick up the changes. We have the write access to non-production databases needed to change things on the fly for various environments, so again, we can handle all of this ourselves. Not being able to change these values in production isn’t nearly as big of a deal, as they don’t generally change. There’s the issue of sensitive data that various operations-related teams may not want that visible to developers with read access to production data. Like before, that can be solved by using a secrets manager as a datastore, although for both visible and sensitive configuration values we’ll need to think about the best caching policies to minimize database lookups for values.

The issue here is, once again, we don’t solve problems #1 and #2. Also, we load a lot of this data once, on startup, for bean creation. To be able to pull from the database whenever we need property values would take a significant re-write of the application logic, and I’m not completely sure we can even do it for everything we have in properties files or environment variables now.

Possible solution – deployment tool customization

The last option is customizing our existing deployment tools and/or writing some custom tooling for deployments. Even assuming “custom tooling” can be as simple as some scripts, that idea just reeks of “not invented here.” There are lots of companies that write lots of custom tooling for their services, but we’re not big enough to make a blanket declaration that the only way to resolve our needs is to write our own custom infrastructure. The reality is the deployment tooling is manged by another team, so we’d either have to get a lot of changes focused around our non-production needs deployed, or write our own tooling to completely own this part of the process ourselves.

The downside to the “just own it ourselves” philosophy is that we’re still responsible for shipping features for our actual product, so anything we write needs to be essentially maintenance free, otherwise we’re going to constantly struggle to find the time to keep important infrastructure running while simultaneously trying to ship features and customer-facing bug fixes.

This idea isn’t all downside – this is the only line of thought that addresses problems # 1 and # 2 – customizing (or re-writing) our deployment tooling at least provides us with something that could address limited environments and the bottlenecks presented. Remember when I said trying to think about what would work with the GitHub flow wound up having a big impact in how I thought about deployments? The natural solution to problems # 1 and # 2 is a solution that lets our team spin up and tear down environments on-demand. Specifically, it makes more sense to get rid of the traditional environment structure (development, QA, etc.) and instead defined our environments based on the top-level features branches that are going to be merged back into main.

“But QA needs a separate environment to test that isn’t the same environment developers are working in!” Not really, QA is generally done on whatever branch you’re merging peer-reviewed code into. Active development (as in, “I’m writing this code now”) is still done in a feature-level branch and not where QA is testing. “But, what about a pre-production environment for final testing and pre-reelase activities? Those can’t happen in an environment that’s still being actively developed and tested!” No, but the defining feature of that stage in the development process is that you already stopped actively working on that branch anyways. At this point, the only code that should be going in is a fix to a showstopper bug. The feature branch should already represent what’s going into production, so why not use an environment that’s already built from it?

The whole problem is that at this part of the process, there are so many different teams that needed our software but with data from different versions of different services, and to do that I had to make actual, separate builds of the same code. In addition, if someone in the process needed our code but using data from a different service, that environment was locked to everyone else until that testing was done. For most use cases, the “main” environment should be fine. For anything non-standard, a custom environment can be created on an as-needed basis.

The biggest concerns here are data and updating potentially multiple environments if the code changes. With data, I’m using the phrase different environments literally – each of these environments will have their own database. That means we’ll need a mechanism for seeding these environments with data. The easiest solution would likely be taking backups of databases in the main feature branches and letting developers pick which branch to load a backup database from (or even start with an empty database). If we don’t want to wait for data to load, we can just set a configuration option to point to a preexisting table, but that would cause problems with our deployment flow in non-production environments.

The other potential issue can best be described with this scenario – let’s say we have a feature branch we’re getting ready for a release, that’s deployed in a non-production cloud account, but also has some one-off builds for custom data sources used by external teams for their testing. Now suppose we find a bug that needs to be fixed before the release. No problem, some quick coding and now we’ve peer reviewed and pushed to the feature branch. We also need a way of showing that the one-off deployments need to be updated too. Any system we write is going to have to be integrated with our build systems, just like the deployment tools we’re using now, that are already being actively maintained by people who’s job it is to support and run tools that make developers lives easier.

What to do

The reality is, thinking about this situation as 1 problem is going to lead me down a rabbit hole of reinventing a wheel that already (mostly works). Breaking this up into 2 major problems, scripted environment creation/destruction, and service management/dynamic configuration are likely the best paths forward. The solution to my original problems # 1 and # 2 can be solved with some additional scripting that lets us create environments as we need them, with some added support to automatically remove the resources when we’re done. The part that’s going to take the greatest amount of work is going to be enabling environment switching.

The “just run our own service registry” seems to be the best starting place here, but it needs heavy modification. I doubt we’d have the ability to run this in production (and wouldn’t need to, production will always point to production environments), but a service registry could easily work as a dependency injector to overwrite values used in our application. We could default to whatever’s in the existing configuration file, with a flag to overwrite the values with data from the registry certain variables are present during deployment. These would never be set in production, but could be configured during a deployment to insert other values instead. We get the flexibility we need in non-production environments, and the current setup continues to work without us having to change production.

At least, this seems like the “right” solution in a blog post rambling about the problem and brainstorming ideas. The truth is it’s probably, at best, the right starting point for a longer-term solution. Ideally we’d use service registries throughout the organization, but setting up something internally to make our lives easier is a good start. For any of this to work our services would have to be refactored heavily, and we’d need a lot of assistance from tooling teams, so it’s hard to say if this route is even feasible, but that’s the downside to one team trying to build something that really deals with multiple teams. Software doesn’t run in a vacuum, and we have surprisingly little room to act completely autonomously. The truth is the hardest part about software development isn’t the actual coding problems, it’s all the other people and their concerns that you need to make sure you’re incorporating.