Aug 312022

I’ve been thinking a decent bit about architecting services lately, and kept finding myself going to the topic of how useful it would be to make “general infrastructure” services (like an offline job processor, or gateway for capturing client events from your web application) shared resources versus making teams deploy and manage their own instances of those services in the broader context of their own work. Re-using existing services has a lot of appeal, but it’s really something that needs specific conditions to succeed. That said, advances in cloud provider functionality would have made implementing a lot of services easier, and my general approach to building applications has changed as a result.

What are “general infrastructure” services?

For the sake of this discussion, I’m defining “general infrastructure services” as “services that solve a non-domain specific problem that are used as components of larger services.” So these would be services that are useful in multiple applications or as a component in multiple services. 1 example from my experience would be a simple service that ran offline data processing jobs. This service basically used Quartz to run instances of scheduled jobs on EC2 instances. It managed kept track of which jobs were actively running, on what instance, as well as managing the instances to ensure no instance had too many jobs trying to run on it, scaling up the number of instances, and managing the total number of instances to protect our AWS bill. Most of the logic was in sending general Quartz status changes to our development admin console so we could see the state of jobs.  Another example that wasn’t duplicated over multiple services but could easily have been was a simple set of endpoints that took in some JSON and promptly wrote the data to an event stream. Event streams are another good example of general infrastructure services that are ripe for data sharing, come to think of it.

What I’m not talking about is running multiple, unrelated services on the same hardware, e.g. running 2 different apps on Kubernetes. Running multiple services on the same hardware is fine (or multiple databases for that matter). I’m focusing solely on re-using smaller services that you’re using to compose other services.

How would using shared general infrastructure work?

Because these services have no service-specific or domain-specific aspect to them, they could pretty easily be made into generic services that everyone simply plugged into and used without having to manage their own deployment. In theory, you can re-use infrastructure (specifically hardware), and save on expenses. It would reduce the operational load of running a service, since you would just be responsible for the parts that are highly-specific to your team’s particular domain, and the shared infrastructure is already up, running, monitored (I’m assuming), and supported. It’s something that likely sounds familiar to people running things on-premises, where adding servers and capacity have very real, and significant, costs.

What are the problems with shared general infrastructure?

The biggest problem, and primary deal breaker, with shared general infrastructure is that it increases the coupling between teams. Even if the shared general infrastructure is self-serve, if your service is likely to offer any significant increase in traffic you’re likely going to need to meet with the team to make sure that they’re able to scale up for it. Your service could push them to the caps they had on resources to control their budget (it’s not like you’re a paying customer they can use to afford more instances or a bigger cluster). Since the team that owns the shared service in question is ultimately responsible for what’s running on it, they’ll likely have some questions about your code and what it does. Oh, and you need to make sure something running in their service has permissions to modify your data, which blurs the logical boundaries between the services and increases the coupling between them. Those boundaries blur even more when you add in the fact that you need to monitor your code running in the other service.

At this point it’s tempting to say “We’ll just slap a (simple CRUD) API on it and make it self-service,” but that’s ultimately a bad practice that leaves you open to problems later, or, at best, a lie. Even with pretty strict guardrails on the shared service to help protect it (and every other service that’s using it), there’s still just too much interaction between your resources and the shared general infrastructure to make shared general infrastructure operate like a normal utility.

Another problem with shared services is that they’re an anti-pattern for service-oriented architecture, where an operational issue in shared service can cause operational issues across the board. Shared infrastructure leads to shared problems once things go wrong, that’s a huge part of why we started breaking applications up into services and paying more attention to resilience against other services going down.

So why even consider shared general infrastructure?

Like I mentioned earlier, combining load onto a shared set of infrastructure can save costs. The key to working that out is tracking the actual usage of your services, confirming they don’t peak at the same time, and that you can support the actual maximum across the board on significantly less hardware than you need deploying these services multiple times.

Generally speaking, a shared general infrastructure service is going to be written and maintained by its own centralized team. That way you’re not trying to run your own services solving your team’s business problems and utility services for others. It also means that you don’t have to run that shared service, which reduces the operational load you have on your code.

Is shared general infrastructure even worth it?

In my opinion, not as a general rule of thumb. Due to the lack of ownership over non-trivial infrastructure related to your service, and the associated lack of control, there’s not enough upside to re-using infrastructure on similar, but unrelated items. Given the proliferation of managed services on all the major (and smaller) cloud providers, there’s even less reason to write the sort of generic, undifferentiated heavy lifting-style services that you could even consider making into shared general infrastructure.

The exception to this particular rule is scenarios where you need a service to be a single source of truth, like an event stream, or a service registry. In that case, because all other services need to be able to have access to the exact same data, having a dependency on a common piece of infrastructure be common to, and re-used by, everyone adds value to an organization.

It is possible to go overboard with the “deploy the same thing for different services” route. For a lot of teams looking to build microservices, there’s a common push to emphasize the “micro” instead of focusing on good logical boundaries. For example, having different services to process different versions of what is fundamentally the same data (as an example, an online order versus an order placed over the phone). What’s the difference? Well, other than a different point of origin and a few minor details, nothing really. The correct way to go would be to set up an alternate route into the main data pipeline and make any necessary changes to the code to accommodate the fact that some fields are optional (any field that doesn’t appear in all records is optional to you, regardless of whether external systems would require them or not – your code should be fine either way).

As tempting as it is to combine some generic tools into simple workhorse services, the benefits of having a single deployment of a common utility just quite frankly aren’t worth the tighter coupling between teams. There are times where having a universally available service used by everyone makes sense, but that only happens when having a second instance of service defeats the purpose of having the service at all. This really isn’t a new insight to anybody – generally the only time you ever really see multiple services using a shared piece of infrastructure for general purpose computing is in an on-premise data center, where there’s a lot more value re-using server instances.

 Posted by at 11:45 AM