One In A Million Events
One in a million events are rare. Getting hit by lightning is an example. Getting tails 20 times in a row with consecutive coin flips is another. Winning the lottery is greater than 1 in a million.
However, one in a million events occur eight times a day in New York City. We see people winning the lottery in the news given the astronomical odds. We read about all kinds of miraculous feats like beating stage IV cancer or being 10 miles away from a nuclear explosion.
If these events are so rare, why do they happen?
Miracles like these are explained with the Law of Truly Large Numbers. In a large enough sample, any outrageous event is likely to happen. New York City has a population of 8 million people, a large enough sample size to cause 8 rare events a day.
Consider the same law for the domain of software engineering. In this domain, we're not concerned about miracles but rather failures. Any failure even remotely possible will occur. No exceptions. The Law of Truly Large numbers assures it.
Software is a unique field where the number of events occurring (events here being a request) is massive. FAANG companies process millions of requests every second. The probability of failure is high for such companies.
But it's not the big players that are bitten by the law. If you are working on anything that sees a modest amount of traffic like 10 requests per second on average, then in an hour you get 864,000 requests. Let's say the probability of a rare event causing total system failure is one in a million, then your system is experiencing these ~24 times a day.
Is it acceptable for your business to have 24 failures a day? For most businesses, the answer is no.
This is why you must consider all possible edge cases, no matter how unlikely when designing a solution. With distributed systems, it's more important because a single request from a user crosses multiple services internally.
Don't assume messages won't be lost in your message broker. Don't assume your data will be saved in the database all the time. Don't assume the input to your system will be valid all the time. Don't assume your servers will have 100 percent uptime. Don't assume the network connection won't terminate abruptly. Don't assume you won't encounter deadlocks. Don't assume third-party service will be up all the time. Don't assume your cache is accessible.
Design for resiliency. Don't get lazy with implementation by thinking it's so rare it won't happen. Because it will.
Add redundancy to your system. Replicate your data. Ensure there's no single point of failure. Secure your application, because the law applies here as well. Build your system to operate in a state of partial failure so your users aren't negatively impacted. Consider all possible errors because your system can and will fail.
Thanks for reading the article. If you haven't already, please subscribe. It's free, and I won't ever spam your inbox. Also, consider sharing it with a friend.