Whenever we write applications or services, we generally include some form of health check that we can easily (and regularly) poll to make sure everything is still up and running. That health check likely confirms that your code has access to everything it needs to function correctly. Well, generally that’s “everything it needs that its developers can control.” Rarely does our health check include external dependencies, even though almost all released software has dependencies that are outside the control of its authors. So, how do you tell when the external systems you rely on to work, don’t?
It’s a best practice for health checks to check all the dependencies are part of your service in order to ensure that not only is your service reachable, but that its dependencies are reachable (by it). Things like ensuring your database connection is still open, or you can still get data from any event streams you may be listening to, just to name a couple of examples. You get the idea – check all the parts you deploy as part of your health check. It’s earned its reputation as a best practice – for small, self-contained services. But if your software is doing anything that’s useful to customers, it’s going to have to talk to external services outside your control. The problem comes when those services go down. You’re not monitoring them, but their errors are still propagating to you.
OK, really the problem comes when your code is impacted by that other service being down. Historically, the solution to this has been to practice defensive programming and to “handle” external dependencies being, well, undependable. You still should, it’s a good way to react to external dependencies failing. It’s just that the problem with this reacting is that it’s all you’re doing. Wouldn’t it be nice to have known that your external dependency was having problems before your users exposed it, namely by your application showing an error message because of a problem somewhere else? Of course, to do that, you’d have to be monitoring external applications and services that you can’t even control.
The issue with monitoring services you can’t actually control, is that you can’t do anything about the root cause of your problem, namely the service you don’t control is having a problem. But what you can do is be proactive in dealing with the issue. OK, thinking about that, “be proactive” may not be the best phrase, more like “react assertively,” which is still pretty good. What does “react assertively” mean anyways? Well, that one’s really context-specific. Is this a service that helps auto-fill some form data (like filling in city and state when a user provides a zip code)? Well, you can just skip trying to auto-fill. Is it a service that you use for processing user-submitted data? You can probably save that to your database and process it offline when the service in question is back up? Is it something that you’d stop and wait to get a response from so you can forward it to a user? The best thing to do there may be to just disable that bit (if you can) until you see that it’s come back up.
Architecting a way to monitor external dependencies is tricky. The first problem is figuring out a way to health check a service you don’t control. That’s likely the easiest part – the worst-case scenario is that you can always just call the service in question. If you have multiple applications or services that use this dependency, you don’t want them all checking on its health, between all the services that rely on the external dependency health-checking and actually using it, you’ll effectively DDOS the service you’re trying to monitor, thus putting it into an unhealthy state. Doing this’s right means that you’re going to have to write a whole new service to do this job (so I hope you’re a fan of microservices, because you’re about to have a new one).
Once you’re checking the health of all these external dependencies, you should be able to pipe the data into your organization’s monitoring and alerting infrastructure. That way your applications and services can get notified for issues with external services the same as you (presumably) do for all of your organization’s internal services. That keeps your monitoring infrastructure all in 1 place, and reduces the code you’d have to write to get any historical record of how healthy these services are over time. If you don’t have reliable internal monitoring, then that’s a much bigger priority for you than monitoring services you can’t actually fix if they break.
The principle that you should write your code assuming that any external services are unreliable is good advice – for reacting to problems in other services. What would be more useful is to detect those problems before users do something that forces the error, allowing you to preemptively react to the issue in a way that’s less annoying to the user. Remember, users don’t care that you have more than 1 service on the backend (or even that there are services being used that your company didn’t write) – they just care that they tried to use your application and it didn’t work. It’s a better experience for them to have a maintenance message up for services that are experiencing problems so they can’t interact with it than to get an error message that a feature on your application didn’t work. The real trick to this is designing a good external solution that can ping services you don’t control (and are absolutely not going to let you put your monitoring agent on their machines). There’s a few good ways to do this, but pick something that works for you so you can make sure you’re handling unreliability in all your dependencies and not just the ones you deployed.