Universal Truths
I firmly believe there are three universal truths about any service:
- Reliability is paramount. If a service isn’t reliable, it fails to serve its purpose. Prioritizing reliability should always be the primary focus.
- The second truth is that you do not get to decide what your reliability is. Despite what your measurements, logs, and metrics say, and even if you have a million healthy instances all reporting up, if your customers or users think you are unreliable, then you are. If you are not meeting the expectations of your users, you are not being reliable. Therefore, you need to take their perspective into account.
- Embracing Imperfection: One fundamental truth about service reliability is that nothing is ever perfect. Striving for a 100% success rate is unrealistic and can lead to unnecessary stress and disappointment. The reality is that, outside of pure mathematical constructs, perfection is unattainable. Failures and disruptions are inevitable, but they don’t have to be catastrophic. Customers generally accept that occasional failures can happen, it turns out that people are actually fine with failure as long as they are infrequent and handled effectively. Instead of aiming for the impossible, set realistic and achievable targets for your service reliability. This approach fosters a healthier work environment, ensures sustainable operations, and ultimately leads to greater customer satisfaction. Aim for excellence, but accept and plan for imperfection.
Embrace the Journey Towards Reliability: Implementing SLOs
Just Do It: Choose a service, select an SLO, and begin measuring it.
When it comes to Service Level Objectives (SLOs), the key is to start somewhere. Choose a service, set an SLO, and begin measuring. Even if your initial targets and indicators aren’t perfect, you can always adjust them. The important part is to begin gathering data and use it to improve.
SLOs can organically grow from a grassroots level within your team. They often start as a bottom-up initiative. While the philosophical benefits can be understood theoretically, the real buy-in happens when people see hard data.
Adjust and Adapt
If your initial measurements and targets don’t yield the desired results, don’t be disheartened. Reassess and pick different targets or measurements. Sometimes, it may just mean you’re not ready for that specific SLO.
Every industry understands that failure is part of the process. The key is to embrace and manage it, ensuring you’re not failing too often. This mindset leads to happier engineers, better businesses, and more satisfied users and customers.
SLIs, SLOs and Error budgets
1. Service Level Indicators (SLIs)
SLIs are your system’s measurements or telemetry. They tell you whether your system is performing as expected. Ideally, these indicators should reflect your user’s perspective as closely as possible.
Service Level Objectives (SLOs)
SLOs set the targets for how often your SLIs should meet user expectations. They use ratios to express this—good events over total events, generating a percentage. Remember, aiming for 100% reliability is unrealistic. Instead, choose a pragmatic target that allows for some failure.
Error Budgets
At the top of the reliability metrics are error budgets. These measure how well your SLOs perform over time, usually within a timeframe ranging from a week to a quarter. Error budgets help you assess whether you’re failing the right amount to meet your SLO targets.
Communicating Through Error Budgets
Error budgets are powerful communication tools. They enable you to:
Report Reliability: They provide a historical record of how reliable your system has been.
Influence SLOs: They help others establish their SLOs based on your performance data.
Make Informed Decisions: They guide resource allocation, whether you need more resources to meet targets or can shift focus based on performance.
For example, if you consistently exceed your error budget, it could signal the need for more resources. Conversely, if you’re consistently under budget, you might shift some focus or resources to other projects, accelerate feature releases, or experiment with new initiatives.
In summary, error budgets not only help in managing reliability but also in conveying past performance and future needs to stakeholders.
Final Thoughts
Reliability is paramount, and your users define what that looks like. Start small, using even a spreadsheet if necessary, and build your way up. Measure, adapt, and communicate. Your journey towards reliability begins with that first step.
Embrace these truths, and you’ll find yourself on a path toward more reliable services, empowered teams, and, ultimately, a more successful organization.