The Abstracted Unreliability Anti-pattern

I don’t know if it’s genuinely on the uptick, or just coincidence, or merely runaway pattern recognition on my part, but I keep seeing the same logical fallacy applied over and over again: if I add more layers to a system it will become more reliable.

A recent example: this set of services have poor uptime so we need to put another service in front of them, and this will – apparently, magically – make them have better uptime.

This might be true in specific cases. If the abstraction layer utilises caching, for example, it’s conceivable it’ll be able to continue operating in at least some capacity while an underlying component is (briefly) unavailable. Or maybe the reduced load to the downstream service(s) will alleviate pressure on racey code and dodgy GC. But this is practically never a factor, in the real-world examples I see. It really does seem to be “step 1, collect underpants … step 3, profit”.

The best you could say is that it accidentally arrives at a practical benefit, despite itself.

It’s important to separate the actual causes of improvement from the red herrings. I can only assume that the confusion between the two is what has allowed this blatantly illogical practice to gain some ground. “We implemented better input validation and an extra layer, and things got better”, and bafflingly nobody ever seems to question how “and an extra layer” contributed.

Now, adding an abstraction layer might have other direct benefits – e.g. the opportunity for use-case-specific APIs, better alignment with organisational structure, etc – but reliability is unlikely to be one of them. Particularly if by ‘reliability’ one largely means availability. Again, it’s crucial to understand the actual causality involved, not miscategorise coincidence.

The logical thing to do in the face of an unreliable component is to simply make it reliable. Anything else is just making the situation worse.