Run It On Cloud Cloud,Non classé Essential Guide to SLIs, SLAs, SLOs and Error Budget Concepts

Essential Guide to SLIs, SLAs, SLOs and Error Budget Concepts

guide-SLOs

Welcome

Wondering what Service Level Objectives (SLOs) are?, this article will explain the service level objectives concepts and how they relate to SLAs, SLIs, and Error Budgets (EBs). 

Objectives: what you will learn ?

  1. ✅ What is an SLO and what’s the difference between SLO, SLI and SLA?
  2. ✅ Why we need a Service Level Objective SLO?
  3. ✅ What is the purpose of an SLO?
  4. ✅ What are the challenges of creating an SLO?
  5. ✅ What is an Error Budget and how it relates to SLO?
  6. ✅ How to monitor your SLOs and Error Budgets?
  7. ✅ How the Error Budgets will drive priorities in your Engineering Organisation?

What is a Service Level Objective (SLO) ?

A Service Level Objective (SLO) is a reliability target, measured by a Service Level Indicator (SLI).

SLOs represent customer happiness and guide the development team’s velocity.

SLOs quantify customer’s expectations for reliability and start conversations between product and engineering on reliability goals and action plans when the goal is at risk. An example SLO for a service is: 99% availability in a rolling 28-day window

Teams and engineers might feel tempted to set the objective at 100%, but that’s too good to be true!

100% is the wrong reliability target for basically everything

— Benjamin Treynor Sloss, VP 24×7, Google; Site Reliability Engineering, Introduction — 

Change brings instability, which will inevitably lead to failure. Not only is 100% reliability an impossible target, but it would also mean that you can’t make any changes to the service in production. So expecting perfect reliability is the same as choosing to stop any new features from reaching customers and choosing to stop competing in the market. 

The rule of thumb for setting an SLO is to find the point where the customer is happy with the service's reliability. 

Increasing the reliability target to no end is not a great business decision. The goal is not perfection, but making customers happy with just the right level of reliability. Once the customers are satisfied with the service, extra reliability offers little value. 

Why? Because if customers are happy with your reliability, they’ll want new features instead of more reliability!

Why have an SLO at all?

Suppose that your engineering organisation decided that running the online services against a formally defined SLO is too rigid for our tastes; and therefore, decided to throw the SLO Culture out of the window and use instead make the service as available as is reasonable operational culture. This makes things easier?, definitely no? You simply don’t mind if the system goes down for an hour now and then. Indeed, perhaps downtime is normal during a new release and the attending stop-and-restart.

Unfortunately for you, customers don’t know that. All they see is that service that were previously succeeding have suddenly started to return errors. They raise a high-priority ticket with support, who confirms that they see the error rate and escalate to you. Your on-call engineer investigates, confirms this is a known issue, and responds to the customer with “this happens now and again, you don’t have to escalate.”

Without an SLO, your team has no principled way of saying what level of downtime is acceptable; there's no way to measure whether or not this a significant issue with the service. 

Therefor, you cannot terminate the escalation early with “Service is currently operating within SLO.”

As Perry Lorier, Google CRE, likes to say, if you have no SLOs, toil is your job.

What is the Purpose of an SLO?

The purpose of SLO is to measure customer happiness, (if applicable) protect the company from SLA violations, and create a shared understanding of reliability across product, engineering, and business leadership. 

What the customers want is just the right balance between reliability and innovation

They want features and want to be able to use the features at any time. Businesses must innovate and release new features to drive revenue and growth while maintaining reliability to retain the customers. Setting an SLO can help teams discuss and agree on how to balance reliability with feature velocity using a data-driven approach 

What is the Difference between SLA, SLI, and SLO? 

SLISLOSLA
DefinitionA quantifiable measure of the current service reliabilityA quantifiable measure of the c service reliability targetA legal contract that if breached will have financial penalties
ExampleProportion of home page requests served in less than 100 ms as measured from the latency column of the server log99% of home page requests in the past 28 days served in less than 100 msMonthly average round-trip transmission time of 200 ms or less for 98% of the requests
OwnersSRE TeamProduct Team, SRE Team, Ops Team, DevOps Team Legal Team, SRE Team, Ops Team, DevOps Team
SLI vs SLO vs SLA

The difference between the three terms is simple. SLI is the indicator that’s used to define and measure the SLO. SLA does not exist for every business, but when there is an SLA, it serves as an upper bound for SLO.

SLO and SLA
S.L. Agreement as an upper bound for the the S.L. Objective

For example: suppose SLA is your credit card limit, then SLO would be your budget, and SLI would be your actual expense. Now, every person has expenses and should ideally monitor the expenses against a set budget. However, not everyone owns a credit card, but if they do, then their budget would be below the credit card limit. If it exceeds the limit, then there will be some repercussions. 

Let’s take a look at the SLI and the SLA in more detail.

SLI (Service Level Indicator)

An SLI is used to measure a service’s reliability. It’s a quantifiable metric built from monitoring data of your service. The key to selecting the right indicator is to find out what your customers expect from your service. Additionally, you shouldn’t choose too many indicators as it will drive your attention away from the most indicative one or two indicators.

Traditionally, for request-driven services, SLIs are either calculated in terms of latency or availability, but you can also use freshness, durability, quality,  correctness, and coverage. For different types of systems, the SLI metric is different. 

  1. Request-Driven Services: SLI is usually calculated in terms of availability, latency. In simpler terms: can we respond to the request? If so, then how long does it take to respond? 
  2. Data processing systems or pipelines: usually emphasize throughput or latency. In simpler words: how much data is processed? And how long did it take the system to progress from data ingestion to completion?
  3. Storage systems: focus on latency, availability, and durability. In simpler words: How long does it take the system to read or write data? Can the user access on-demand data?   

You can use the following SLI Menu to pickup the right indicator for your service/system:

SLO , SLI Menu
SLI Menu – Art of SLOs Google

SLA (Service Level Agreement)

An SLA is a legal agreement between the service provider and the customer. It includes the minimum reliability target for the service and the financial consequences of not meeting it. The consequences may include a partial refund, discounts, or extra credits. An SLO is an internal objective for your team and is not usually a part of the client contract. 

SLOs and SLAs are often confused, but they’re two distinct concepts. Because SLO is an internal objective, it does not have an associated financial penalty when breached. When there is a SLA, then the corresponding SLO is generally tighter. For example, if the SLA defines 99.5% uptime, then the internal objective can be 99.8%. By setting a more stringent internal objective, the SRE team gets a chance to take proactive actions to avoid going over the SLA and breaking the contractual agreement. 

SLAs, unlike SLOs, are set by the business development and legal team. That can be a challenge as both legal and business development teams are not directly involved in building or running the technology. Therefore, involving the engineering teams (SREs, developers, QA, DevOps, Data, …etc) alongside can increase the probability of creating a functional SLA. 

Who Defines the SLO? 

Defining an SLO is a collaborative process driven by the SRE team, but it requires input from multiple stakeholders across the organization. 

The key stakeholder involved are:

  • Product Owners: try to anticipate the customer needs and communicate them to the development and SRE teams. Ideally, they contribute to the definition of SLO to reflect customer needs. 
  • SRE & ops teams: they help ensure that the SLO is realistic, sustainable, and without excessive toil that causes burnout.
  • The development teams: comprises software teams that are developing the software product. They can pitch on about the SLOs and negotiate relaxation if reliability work is slowing down the release velocity.
  • Both internal and external users and stakeholders fall under the Customer umbrella: they contribute to SLO definition via feedback meetings, customer complaints, Tweets, SLA, … etc

Once the SLO is determined, it’s documented by the authors as an SLO Worksheet reviewers (who check for technical accuracy), and approvers (who weigh in based on business considerations).

Here is an example of an SLO Worksheet:

SLO Worksheet
SLO Worksheet, Adopting SLO – Google –

To make service level objectives work for various parts of the organization, each team would ideally agree that the SLO is a reasonable approximation of user experience and use them as the principal driver for decision making. Not meeting the SLOs usually has well-documented consequences that redirect the engineering efforts towards improving reliability. To enforce the consequences, the operations team requires executive support.

At the end of the day, SLOs align incentives but they’re not enough on their own. In a heavily siloed organization, it’s much harder to reach an agreement. The best chances of success are when there’s a shared sense of responsibility between the developers and the SRE team. Developers feel their responsibility towards making the service reliable and the SRE team feels the responsibility to help the developers actively release new features. 

What are Some Characteristics of a Well-thought-out SLO? 

A good service level objective must align with the company’s specific business needs.

For example, if all your customers are in the same time zone, and use the services 10-5, then availability outside of the active hours wouldn’t matter to the customer. As the customer will not try to access the service, they won’t be unhappy if it breaks during their inactive hours. 

Secondly, according to Google’s paper on meaningful reliability:

A good service availability metric should be meaningful, proportional, and actionable

Meaningful, in this context, means it captures user experiences. Proportional means that any change in the metric must be proportional to a variation in user-perceived availability. Finally, actionable means it provides system owners an insight into why availability was low over a specific period of time. 

Lastly,

A solid SLO must be realistic

The objective shouldn’t be too far off from how the services have been performing so far, and it’s best decided with the team’s resource constraints kept in mind. You don’t want to aim for the stars and demoralize your team with an unrealistic objective. 

What are Some Challenges and Pitfalls of Creating SLOs?

Creating service level objectives can be quite challenging especially in the beginning. Everyone wants 100% reliability, which is unrealistic. It means that the service has zero error budget (no tolerance towards failure), which is a drawback in itself. 

Another common pitfall is starting with way too many SLOs at an earlier stage. Given the complexity of most systems, starting small and iterating over time is the best course of action. You don’t want to make the system more complex than it needs to be. Only the most critical services need to be measured, and would ideally have only two to six SLIs per each one.

Not spelling out SLOs in plain and simple language is another common pitfall. Since it’s mostly an internal objective, they’re usually to help the development and SRE team balance feature development with reliability work. They should ideally be simple enough that anyone on the development and SRE team can understand it.

Finally, creating an SLO is a collaborative process that requires input and buy-in from everyone including the leadership, development team, SRE team, product owners, etc. To make the SLOs work, all relevant teams and individuals agree that it’s reasonable and can be used as the basis for decision-making. Also, consequences from not meeting SLOs can hardly be enforced without executive support. 

Service level objectives are used to calculate error budget – a tool used to balance innovation with reliability.

Error budget defines the acceptable level of unreliability that a service can afford without impacting customer happiness. 

As long as the service remains within the error budget, developers can take more risks. On the other hand, when the error budget starts to dry up, the developers would likely need to make safer choices.

Here’s how you can calculate the error budget using SLOs:

Error Budget = 1 - Availability SLI

For example, if the SLO is 97%:

Error Budget = 1- 97% = 3% 

For example, if you set a three nines SLO target (99.9%), that means you can serve one error in every 1000 requests to your users. 

Or, in terms of complete downtime, your service can be unavailable for a little over 40 minutes in a four-week period. Rather than just passively measuring the SLO, and potentially exceeding it substantially, we can treat the acceptable unreliability as a budget that we can spend on various development and operational activities.

In the following availability SLO/Error Budget (EB) table, we will list the availability vs. downtime per year and month:

SLO, Error budget
source: Adopting SLOs – Google –

SLOs Monitoring – Who Does It and How? 

The best way to monitor SLO is through monitoring error budget policies.

The SRE team sets the monitoring system to send an alert if a particular percentage of the error budget is consumed. For example, send an alert if 75% (or 50%) of the error budget is consumed over a 7-day period.

Monitoring SLOs is mainly the job of the SRE team. They collect the SLI metrics and work with other teams to define the SLO. If the SLO is at risk, the teams then decide if any action is needed and figure out how to meet the SLO. The SRE team collaborates with development and product teams to ensure the targets and policies are agreed upon.

The process begins with monitoring and measuring the service’s SLIs over time. The SLIs are then translated to SLOs by adding a reliability target on a window of time (28-days e.g) to the SLIs.

The SRE team will then turn these SLO definitions into tangible dashboards and alerts against the error budget consumption. If an action is required, then the owner team with the help of the SRE team figures out what steps must be taken to meet the target. Without SLOs and error budget policies, the SRE team will have no way to decide whether and when they should take an action. 

How Often is the SLO Evaluated?

Running a service with SLO is an adaptive and iterative process. In a 12-month period, a lot will change. New features might not be covered, customer expectations might change, or potentially, there might be a change in the company’s risk-reward profile. 

It’s important to reevaluate SLO every few months because it’s no good if you’re meeting your SLO but your customers are still unhappy and complaining on Twitter or Zendesk. Review and reevaluate your SLO every few months and follow up with a similar review every six to twelve months. 

There are no hard and fast rules established regarding the SLO evaluation. Depending on the product, expected usage, and managing team, SLOs can be different for each team. You can consider all user groups such as mobile users, desktop users, and people from various geographic locations and modify the SLOs accordingly.

Evaluate and refine the targets until you locate the most optimal point. For example, if your team is continuously performing way above the SLO, then: 

  • You can tighten up the SLO and increase service reliability (and maybe tell your customers about your superior reliability as a competitive advantage), or
  • Capitalize on the unused error budget by investing in product development or experiments. 

Whereas, if your team is continuously struggling with keeping up with the SLOs, then: 

  • You can bring them down the SLO to a more manageable level. 
  • Invest in stabilizing the product before rolling out new features.
SLOs are continuously evaluated, and are all about learning, innovating, and starting over!

What are the Consequences of Not Meeting an SLO? 

The consequences of not meeting the SLOs usually involve code freezes, slowing down development, and shifting more resources towards bug fixes. 

What’s important here is that the consequences of not meeting the SLOs are agreed upon by the product team, developers, and SREs.

SREs will also make sure to use the error budget policies to alert relevant leaders and teams as soon as the SLO is at risk.

How can Developers “Spend” their Error Budget (EB)?

An error budget is just like your house budget. It’s the allowed expenses (unreliability) that your system can afford without making the customer unhappy. Just like your house budget, you’re allowed to spend your EB within a given period as long as you don’t overspend. 

Developers can spend the EB any way they see fit. Teams new to SLOs often release new features as frequently as they want only to suddenly realize they’ve spent all of their EB and it’s time to stop shipping new features. With better alerting, teams learn to slow down development by the time they spend a significant percentage of EB. As teams advance in SRE maturity and gain better control over how to spend their EB, they begin to strategically spend their EB by taking calculated risks with shipping innovative or experimental features. EB prevents companies from going after too much reliability at the expense of these innovation opportunities that don’t impact customer happiness.

This is how EB can eventually speed up innovation and velocity. In fact, increasing the development velocity will give your product an advantage over the others. By outpacing your competitor, you’re urging companies to buy your product first. Since you’re the first one in the market, there’s less competition and more chances of success. By the time your competitor’s product even hits the market, you’re already hitting your business goals! 

What Actions Should a Team Take if their Error Budget is Spent or Close to Spent? 

The SRE team works with the development team to implement alerts and policies to minimize customer impact in the case when different amounts of error budget have been burned (50%, 75%, 100%, for example).

A team may choose to alert higher levels of management as the EB burndown gets closer to 100% and the manager would determine the best course of action accordingly. This alerting/EB policy is what makes EBs and SLOs actionable. In fact, Twitter did not successfully implement SLOs until they instituted EB policies as well. 

If a team has burned their entire error budget, previously agreed-upon policies can come into effect to prevent further customer impact. For example, the manager may go into code red and freeze all new releases until they’ve brought the number of errors down to a reasonable point. If there are way too many errors, then the SRE team may have to do a system rollback. That gives developers enough time to deal with the errors gradually and release the changes over time. 

Here are some ways that the development team can focus on improving reliability instead of shipping new features when the error budget is spent or nearly spent:

  • Fixing bugs in the program code or resolving procedural errors.  
  • Soften hard dependencies that were identified in previous incident retrospectives. Removing dependencies will make the code less complex and easier to manage.
  • If the EB was consumed by miscategorized errors (incorrectly categorized errors) that would have caused the service to miss its SLO, the errors must be categorized appropriately to avoid further confusion.

What Actions can the Development Team take if they are well Above the Target Uptime? 

If the development team is well above the target uptime, then they have an advantage. It allows them to increase their push velocity and take risks without putting the product at risk. 

Here are a few things that the development team can do if they’re well above the target uptime:

  • Introduce bigger changes
  • Increase release velocity 
  • Take risks without troubling the SRE team

By the end of the day, defining service level objectives can be challenging for any organization at the beginning, but at it’s an important milestone in the journey to your reliable application delivery.

Oh, One More Thing

MentorCruise MentorCruise


References

Leave a Reply

Related Post