System Design - Designing Reliable Systems That Never Quit

A senior system designer does not try to prevent failure. They design a system that expects failure and recovers gracefully.

Nov 08, 2025

Reliability Green Belt | Reliability Engineering Academy

In our last post we learned that to scale a system we have to break it apart into many smaller independent services a microservices architecture.

This horizontal scaling is great for handling massive traffic but it introduces a critical challenge failure is now guaranteed. When you have 100 independent components running on 100 different servers something is always going to fail a network cable breaks a server runs out of memory or a database connection times out.

A senior system designer does not try to prevent failure. They design a system that expects failure and recovers gracefully. This is the core of reliability.

This guide is about building systems that do not fail or more accurately building systems that are so resilient they appear not to fail to the end user. We will explore the essential patterns like redundancy heartbeats and circuit breakers that allow complex applications like Netflix and Google to stay available 99.999% of the time.

The Foundation of Reliability Redundancy

The most basic and most powerful tool for ensuring reliability is redundancy. It is the strategy of having backups ready to take over instantly when the primary component fails.

No commercial aircraft flies with just one engine. It is designed to fly with two or four so that if one fails the aircraft remains airborne. Your software system must be designed the same way.

Component Redundancy

This is simply duplicating your components.

Web Servers

Instead of running one web server, you run three identical ones behind a Load Balancer. If Server A crashes the Load Balancer immediately redirects all traffic to Server B and Server C. The user never sees an error. This is essential for High Availability.

Database Replication

You never run a system with just one database copy. The primary database handles all the write requests (updates, saves). It constantly copies or replicates all that data to one or more replica databases. If the primary database fails a replica can be immediately promoted to take over.

This duplication costs more money (a budget constraint we discussed in Blog 2) but it is the non negotiable price of high availability.

Zone Redundancy

It is not enough to have a backup server in the same physical building. If the entire building loses power due to a disaster like a flood or a fire your entire system is down.

This is why designers use multiple Availability Zones (AZs). An AZ is a completely separate data center usually many miles away from another AZ. They have separate power sources separate networking and separate cooling systems.

By deploying identical copies of your entire system across three different Availability Zones you ensure that the failure of an entire data center will only cause a brief delay not a full system outage.

Detecting System Failures

Redundancy only works if your system knows exactly when a component has failed. A server that has stopped responding is useless until the Load Balancer recognizes it is dead and removes it from service.

Health Checks

A Health Check is a simple API endpoint on a component (like a web server or microservice) that an external system calls to confirm it is alive and well.

A good health check does more than just say “I am here.” It checks critical internal functions. For a payment service the health check might verify “I can talk to the database” “I can connect to the third party payment gateway” and “My memory is not full.”

If the component returns a 200 OK status it is healthy. If it returns an error status or fails to respond at all it is considered unhealthy.

The Load Balancer constantly runs these health checks on all the servers behind it. If a server fails the check it is instantly taken out of rotation.

Heartbeats

A Heartbeat is a slightly different pattern often used in distributed processes. Instead of an external system checking the component the component itself actively sends a periodic signal to a central monitoring service.

You have a cluster of 50 servers processing images in the background. Every 5 seconds each server sends a heartbeat message to a central registry saying “I am alive and working.”

If the central registry does not receive a heartbeat from Server 42 for 15 seconds it concludes that Server 42 has failed and sends an alert to replace it.

Health checks are usually for API traffic load balancing heartbeats are typically used for background processing worker pools.

The Defensive Patterns

Once one part of a system fails that failure can quickly spread. This is known as a cascading failure where a small problem brings down the entire application. These defensive patterns are designed to stop that spread.

The Circuit Breaker Pattern

Imagine your application Service A needs to call Service B to fulfill a request. Service B is slow because its database is struggling. Service A keeps calling Service B which just slows Service B down even more until Service B completely crashes. Service A then crashes because it is waiting for Service B which then causes Service C to crash and so on. This is a cascading failure.

A Circuit Breaker acts like a real electrical circuit breaker.

Closed

The circuit is normal. Service A calls Service B.

Open

If Service A sees that 50% of its calls to Service B are failing or timing out within a 10 second window the Circuit Breaker flips to Open.

Short Circuit

For the next 60 seconds Service A stops sending any requests to Service B. It fails the request instantly on the client side with an error or sends a fallback response. This gives Service B time to recover because it is no longer being hammered with requests.

Half Open

After the timeout period Service A sends a single “test” request. If it succeeds the Circuit Breaker closes and normal traffic resumes. If it fails it stays open.

This is critical because it forces the failure to be contained and allows the struggling component time to heal itself.

Retries with Jitter

When a network request fails it is often temporary. Maybe the server was briefly busy or the network hiccuped. A simple solution is to retry the request.

However if 10,000 users all retry at the exact same millisecond after the first failure you will simply overload the server again.

The solution is to use Retries with Jitter or Exponential Backoff.

Exponential Backoff

If the first attempt fails you wait 2 seconds before retrying. If that fails you wait 4 seconds. If that fails you wait 8 seconds and so on. You increase the delay exponentially.

Jitter

You add a small random amount of time (the jitter) to the delay. So instead of waiting exactly 4 seconds you wait 4 seconds plus a random 0.5 seconds. This ensures that the thousands of failing requests spread out their retries preventing a coordinated attack that overwhelms the recovering server.

Graceful Degradation and Fallbacks

A truly reliable system prioritizes the user experience even during a partial failure. This is called Graceful Degradation.

The system should continue to operate in a limited but meaningful capacity when key services are unavailable.

Full Service

On an e commerce site you can browse products see recommendations and see current inventory levels.

Degraded Service

The Recommendation Service crashes. Instead of showing an error the system uses a Fallback Pattern. It simply stops displaying the recommendations section or shows a static message like “Popular Items” but the core functionality browsing and buying remains fully operational.

Prioritization

A designer must prioritize core functions over auxiliary functions. If the payment gateway fails that is a critical failure. If the “You might also like” service fails that is a non critical failure that can be gracefully managed with a fallback.

Netflix’s Resilience (Case Study)

Netflix is the best real world example of these principles in action.

Redundancy

They run their entire service across multiple Availability Zones in Amazon Web Services (AWS).

Health Checks

The load balancers constantly check the health of every viewing service server.

Circuit Breakers

Netflix actually created and open sourced a library called Hystrix (now deprecated but the concept lives on) to implement the Circuit Breaker pattern across all their microservices ensuring that slow or failed services do not take down the entire streaming experience.

Graceful Degradation

If the service that manages personalized artwork (the thumbnail images) is slow they will show a generic default image instead of making the user wait. The stream starts immediately even if a secondary service is lagging.

Netflix’s approach is a testament to designing for failure. They even run a tool called Chaos Monkey that randomly terminates servers and services in production to test their own resilience every day.

The Next Step

Designing a reliable system means shifting your mindset from prevention to recovery.

You achieve resilience through Redundancy (having backups) Detection (health checks and heartbeats) and Defense (circuit breakers retries and graceful degradation). A highly available system is simply a collection of components that have been carefully separated and protected from each other.

With a strong foundation in reliability established we can now turn our attention to the heart of the system the data.

In the next post “Data Design and Storage Strategies” we will dive into the permanent brain of the system the database. We will discuss the trade offs between SQL and NoSQL and learn how to manage data for massive scale using techniques like indexing sharding and partitioning.

BinaryBox

Discussion about this post