Why Designing for the SLA Matters

When we design systems, it’s easy to get absorbed in technology stacks, architecture patterns, or frameworks. But there’s a principle that too often gets overlooked — designing for the SLA (Service Level Agreement).

SLA is not just an uptime percentage on a dashboard. It’s about understanding:

Who the stakeholders are.
What business processes they depend on.
What happens if the system fails them.

Let’s explore this through two real-world-inspired examples.

Example 1: Vaccination System During COVID-19

Think back to 2020. The world was under strict restrictions, and vaccination campaigns were critical. Governments had to build digital systems — fast — to support this effort.

Take Poland as our case study:

Population: 37 million people
Nurses administering vaccines: ~200,000
Seconds in a day: 86,400

If every citizen needs to register for a vaccine, the requests per second (RPS) from patients is enormous:

Meanwhile, for nurses verifying patient details and recording doses:

Clearly, the patient side generates much more load. But SLA is not about traffic volume — it’s about business impact. Let’s think about what will happen for certain stakeholders if we have a disruption in service operation

Patients: If their app doesn’t work today, they simply try again tomorrow or the day after. The vaccination still happens eventually, so the business risk is minimal.
Nurses: If their app goes down, it’s a disaster. Vaccines removed from the fridge expire within an hour. If nurses cannot record doses or verify patient details, thousands of vials could be wasted in a single day. The system directly impacts the effectiveness of the entire vaccination campaign.

👉 That’s why nurses require a much stricter SLA than patients, despite generating fewer requests.

Implication for System Design

If resources are limited, the safer and smarter approach is to separate the systems:

One application dedicated to patients.
One application dedicated to nurses.

This way:

Critical nurse workflows remain stable and insulated from patient-side surges.
Complexity and cascading failures are reduced.
Each application can evolve independently: the nurse system with a strong focus on reliability, the patient system with more flexibility to adapt.

In short, not all stakeholders are equal — and your architecture should reflect that.

Example 2: Car Rental – Rental vs. Return

Let’s shift to another industry: car rentals.

Two main processes matter:

Rental – customer picks up a car.
Return – customer brings the car back.

Which one has the higher SLA?

If rental fails, the company simply loses revenue for the day. It’s not great, but survivable.
If return fails, it’s catastrophic:
- Customers get frustrated.
- They flood call centers.
- They leave negative reviews.
- Worst of all, they escalate complaints to the UOKiK (the Polish consumer protection authority).

Repeated violations can result in millions in fines (the most famous case being around 10 million PLN).

👉 In this case, return services require a higher SLA than rental services.

SLA and Frequency of Change

Processes with high SLA requirements must change less often (e.g., “return car”).
Processes with low SLA requirements may change frequently (e.g., “rent car” with new promotions, pricing, features).

The more you change something, the more likely you are to introduce bugs. That’s inevitable. See the Microsoft whitepaper that describes that.

So the smart design choice is:

Keep high-SLA, low-change processes isolated.
Allow low-SLA, high-change processes to evolve more quickly.

This reduces risk. You don’t want a small bug in the fast-moving “rental” module to suddenly take down your critical “return” system. Never allow a situation where a service has a high SLA and is expected to change frequently. That will never happen in real life!

Closing Thoughts

Designing for SLA is about asking the hard questions upfront:

Who are the stakeholders?
What happens if the process fails?
Which parts of the system must never go down, and which ones can tolerate some disruption?

In both examples — vaccination systems and car rentals — the answer was clear. Not all stakeholders are equal, not all processes are equal, and your architecture should reflect that.

When resources are limited, separate applications are often the right answer. This approach minimizes risk, keeps critical processes reliable, and allows for safe evolution of the rest.