The Need for Site Reliability Engineering: A Podcast with Blameless Founders Ashar Rizqi and Lyon Wong

In this episode of the Designing Enterprise Platforms podcast from Early Adopter Research (EAR), EAR’s Dan Woods covers the topic of site reliability engineering. Site reliability engineering is a practice that rose up in the internet giants to help them create highly reliable, high performant, but also fast moving systems. It is essentially an extension of the dev ops movement. To delve into the topic, we spoke with Ashar Rizqi and Lyon Wong, the founders of a company called Blameless. Blameless takes site reliability engineering practices and productizes them. Their conversation covers:

* 2:00 – The story of the creation of Blameless
* 13:00 – The history of site reliability engineering
* 22:30 – What a blameless culture looks like
* 36:45 – Escaping a culture of blame

Listen below:

Q&A

Woods: Ashar and Lyon, could you both please introduce yourselves, and then describe Blameless.

Rizqi: My name is Ashar Rizqi. I’m one of the cofounders and the CEO of Blameless. I’ve been in the traditional IT operations, DevOps, sys admin, SRE domain for almost a decade now. I started my career in financial services doing traditional enterprise IT type of work, following ITIL principles. After working at Box.com, I ran the platform engineering organization at a company called MuleSoft, which would later go on to be acquired by Salesforce. That was another opportunity to bring in SRE best practices into a company that was doing extremely well as a business but running into a lot of cultural, technological, and operational challenges.

In a sense, you were present at the creation of site reliability engineering and you lived through its expansion and growth?

Rizqi: I wouldn’t say I was present at the creation of SRE. I would say I was present at the creation of SRE at some of these fast-growing companies. But as a movement, you know, SRE has been around since 2003. It was pioneered by Google. It took a while for that set of practices to actually catch on, and it was the fastest growing, fastest moving technology giants of today that were adopting those practices.

Wong: I’m one of the cofounders as well and the COO of Blameless. I started my career at Microsoft, pre-SRE. But the idea of reliability is core to an OS. There were initiatives to reduce the number of hangs and crashes that I was a part of. And reliability of the operating system was paramount. It was interesting to see how this evolved as we moved into the cloud, as we moved into microservices, and that really was from the perspective of a VC (I was a partner at Lightspeed for the past six years). People are going to depend on their apps as much as they would depend on something like Windows because you get attached to having something on demand and available. Well, how are you possibly going to deliver that, because it’s no easy feat to try to do that within an OS.

What is the mission of Blameless?

Rizqi: We started Blameless because we felt the pain of our customers and also the pain that our teams were experiencing in trying to clean up the messes that are created, essentially called technical debt, over time by companies that are doing really well and growing fast and making poor decisions, not out of any malintent, but just doing what’s needed at the time to actually move fast. That pain has to be dealt with by somebody. If you can think of an example of using the Uber application on your phone, when you pull it up, you put in the destination that you’re trying to get to and then within a matter of seconds, what you end up getting is a driver, pricing, and you’re on your way. Now, if that process takes anything longer than a few minutes, you’re just going to swipe your phone up, open up the Lyft app and repeat the process and what’s happened is that Uber has just lost a customer to Lyft or vice versa.

So the idea here is that site reliability engineering sits at the end of the line, where everything that’s built is right in front of the customer and being used. It’s the problems that you find there and how you solve them—that’s really what you’re trying to figure out. And then how do you create a process where the awareness of all of this gets expanded so that the person writing the line of code is aware of the problems you have at the end for the user.

Rizqi: That’s right. I’ll only add one more thing to that and it’s how do you do that without assigning blame and pointing fingers and creating a culture of self-perpetuating negativity inside the environment.

So Blameless, the name of your company is highlighting the fact that in order for this process to work, we can’t be trying to shoot the person who made the mistake?

Rizqi: That’s exactly right.

Now let’s talk about the whole idea of site reliability engineering. What is site reliability engineering and why is it ready for a wider audience?

Wong: I can speak to why SRE can be broadly applicable because it’s really instrumental to the software development process. As you’re making decisions of how to architect the system, choosing pieces for the infrastructure, nothing is perfect. Otherwise, you’re going to take too long to ship. So you’re making decisions, where you know there are issues, you’re going to come back and fix that, but as things move quicker, as you’re growing successfully, you rarely have time to come back and fix everything. That’s what we call technical debt. It accumulates.

What happens when these issues start impacting the user? That’s when you have to have a system, and SRE is one such system whereyou can prioritize and assess when you should fix those issues before they become customer impacting. One core concept of SRE in practice is the idea of setting SLOs, service level objectives, how do I set an internal objective that I can address to raise the performance or reliability of a part of a system without the customer and end user facing that. Being proactive is key. SRE is helping address many issues that build up, and discover what should you fix first and what should you prioritize.

Mainstream IT necessarily isn’t building everything the same way that these Internet giants are. How does the site reliability engineering process play with product-based platforms, and usually multi-product platforms?

Wong: It actually becomes even more important, because within the context of an OS, we own all the lines of code and there’s a lot of it, but you own everything. You can eventually go and figure that out. But here, you have a lot of dependencies. Each service, each other product you’re subscribing to in itself is a black box and it would be wrong to assume each of those little boxes is flawless. You’ve also magnified teams of software engineering that have their own technical debt. You don’t know how that combination comes together. At the end of the day, it’s going to be a complex mix of someone’s technical debt colliding with someone’s update and someone’s bug and you need to ignore all those choices and say we’re going to promise, at the end, the highest quality or a certain level of reliability for our end user.

The idea is that mainstream IT will be able to look at what’s actually impacting the end users and then raise those issues up and have the whole team figure out how to address them?

Wong: Exactly. They can sandbox it, they can have more redundancy backups. You need to have a failover plan.

Ashar, can you talk about the history of SRE?

Rizqi: SRE in its modern form has only existed since 2003, when Google started putting this into practice. But the core principles of reliability and placing value on the experience of the person that’s using a particular product or service or trying to achieve a mission has been around for decades. The theme that we see as we’re talking to folks is there’s references back to earlier decades and mature practices that allowed us as a civilization to do amazing things. In the older days of traditional product development, we had a waterfall methodology and we had a notion of silos within any organization. You’ve got a development team that is very different than an operations team. That’s where the sys admin mindset comes in. As a sys admin or operator the things that keep me up at night are when the product or feature is broken, I’m being paged at 3:00 a.m. to solve this problem. So my incentive as a sys admin is to prevent change from going out into production as much as possible. Those two incentives have been historically misaligned.

That creates this culture of conflict between the development team and the operations team, which would then create a culture of blame, which would then lead to things slowing down. That’s how the IT industry has been operating for decades.

The next phase that came into play was the DevOps movement, which was in response to the tension that existed between the operations team and the development team. It was a set of culture and practices whose goal it was to reduce the gaps between software development and operations. The pattern that we saw is that it was meant to be a cultural movement and it turned into how do we DevOps continuous delivery.

Is it fair to say that the DevOps movement was being done before it was actually codified?

Rizqi: Precisely right. The example I love to use here is that when I joined Box.com, I like to consider us, the initial team that I worked on as one of the best practitioners of DevOps. The fundamental reason for why we succeeded, and this sounds facetious, is that we never called it DevOps. And only now when we reflect back and we think about the principles we used to operate, we realize that, hey, we were actually practicing DevOps.

SRE came out of Google around 2003 when Google was going through this massive growth rate and doing extremely well as a company, and was about to go public and needed a new way to operate to reach the scale that Google is at now. That required a different mindset. SRE emerged independently of DevOps, but it has the same set of principles. There are differences in specific implementations that you can classify as an SRE practice but the core of it is very much the same. SRE has then permeated out, particularly at the small upstarts, which are just not going to be able to throw thousands of headcount at a particular reliability problem. They have to get creative. But the principles have stayed intact.

A fair way to describe what SRE is that it’s a focus on the last part of the DevOps process where you’re observing the interactions of the systems and the users, you’re looking at what’s going on, and then you’re trying to figure out a way to accelerate the whole DevOps communication. And SRE is a name essentially for the departments that were at that user-facing operations end.

Rizqi: That’s right. The best way to talk about it is to compare the principles within DevOps and SRE to see where the similarities come in.

The principles that we talked about were no more silos, elimination of toil, measuring user happiness, use of service level objectives, use of error budgets, and the general mindset of blamelessness. Are those what you consider SRE principles?

Rizqi: Yes. No more silos means that now we’re breaking this proverbial wall between developers and operations. That means there is a shared sense of ownership and we’re all using the same tools and techniques and we’re all using the same set of practices to get the job done. We’re all going to be aware of how to write code well and the best practices when it comes to testing and quality and continuous delivery. The incentives are always aligned. That’s what we mean by reducing organizational silos and shared ownership. We’re all speaking the same language.

So it’s not “Why did you give me this crappy code?” or “Why can’t you make my beautiful code run properly?” It’s “How do we do a great job for the users?”

Rizqi: Exactly. And then the definition of toil is doing the same thing over and over again without adding any type of intellectual value. Traditional enterprise IT is rife with this, and the example I like to use is ticketing in general. The number of clicks and the amount of follow up that folks have to do inside these systems is unacceptable these days. There’s a human being whose job it is to spend hours every day creating tickets for other people to follow up on and update. This leads to crazy practices. There’s often an outsourcing of handling of tickets, where an outsourced company, offshore perhaps, will be handling tickets and they are being rewarded by how many tickets they resolve in a certain amount of time. But then it turns out that the clock restarts when you reclassify a ticket. So a practice develops where they look at all the tickets that are about to expire unresolved and they reclassify them all so that they never get any that are unresolved and miss the deadline. That’s what I would call toil. What you’ve done is created the wrong incentive for people to do work. In SRE, the main focus is how do we automate ourselves out of a job? That’s the core principle that drives this notion of how do we move away from toil and into high value. The goal is to minimize manual systems work and bringing long term value to the system.

How do you explain the measuring user happiness principle of SRE?

Rizqi: We all need a purpose that we’re trying to go after. Typically, traditional enterprise IT is so detached from the business and the business is so driven by users and the user experience that there’s disillusionment. Whereas within SRE it’s the opposite, where all of the work that’s being driven within the SRE domain is 100% tied to user happiness. User unhappiness is one of the biggest contributors to unreliability in the system. They’re going to leave. They’re going to go somewhere else and guess who suffers? Everybody.

Whatever metrics you decide represent user happiness, that is what you’re going to be operating and measuring for. Measuring user happiness is an umbrella and that is represented by certain types of metrics, service level objectives.

Rizqi: Yes. Historically what we’ve had is an SLA, particularly as more and more traditional IT practices are moving into this hosted SaaS type of model. My question is how does one number represent all of the complexity for potentially thousands of people and systems all coming together to provide a particular kind of user experience? The principle behind SLOs ties to the principle we talked about before — let’s only focus on user happiness and not worry about the thousands of other metrics. Principle number two within SLOs is there’s no natural system in the world that can commit 100% of anything to you as a human being. Why do we expect that from our systems, why do we expect that from our products? So let’s accept the reality and the fact that there is a suboptimal but perfectly okay state for us to be operating in which is not 100%. It could be 99%, 99.9%. That again has to be tied back to what is acceptable to my user, and that difference between the user experience and what they’re going to tolerate is what ties into error budgets. There is a budget, regardless of how big or small it is, where it is acceptable for your users to have a certain level of degraded experience and that’s okay. But what that gives you as an organization is the sense of empowerment that you can decide I have this lever that I can use to either slow down or move fast and it’s all tied back to my user experience. This is a very objective metric. There’s no opinion here anymore.

The idea is that you have user happiness that you’re measuring in certain levels. You then have metrics or service level objectives for each thing that’s important in delivering user happiness. The difference between a service level objective and a service level agreement is that the minimum amount of performance is defined by user happiness and its impact on user happiness. So it could be that a service level could be if it’s operating 80% of the time is fine because it doesn’t have that much impact on service. Another service might be five nines or six nines because if it fails once, it has a huge impact. Then that the difference between perfection and where user happiness is affected, that’s the error budget.

Rizqi: That’s right. The only thing I would add to that is an SLA is, ironically, not even a number set by the teams that are developing the product. Typically, it gets set by somebody in the sales organization or the legal team because they’ve been asked to promise a certain level of uptime or availability. The SLO is a very objective number that’s focused on user happiness metrics and the thing that’s informing those metrics are your existing monitoring systems. There’s no human being that’s coming in and informing this metric. It’s the system that’s actually doing it.

It seems service level agreements are very similar to operating level agreements because you’re actually talking about the real work we’re going to do and what’s important about that work and how to measure that.

Rizqi: Yeah. The focus should be within the SRE principle of removing these silos and collaborating, why can’t we collaborate with our customers and come to an agreement on what we both collectively think, what is the inherent or implicit agreement that we’re having?

So finally we get to the name of the company concept, blamelessness. Obviously we don’t like to live in cultures where people are blaming, but it seems that we often, especially in IT and operations cultures, blame. If anything goes wrong, then I’m in trouble. How do you escape that culture of blame?

Rizqi: This is one thing that both Lyon and I are extremely passionate about and something that we’ve been very intentional about the way that we practice at our own company, Blameless. It stems from our experiences.

How would you explain simply what it means to be blameless?

Rizqi: A blameless organization means that at no point in time when something bad happens that an individual or a set of individuals are being called out explicitly for that particular thing. The system is assuming the blame.

Wong: How I look at that is also there’s a maturity level from the organization. So if you’re trying to blame the person or the person that triggered the event, that is not a mature way to look at the situation because you’re expecting too much from the system and it’s like blaming the user for deleting a file. But as the organization becomes more mature, you realize that the person finding the issue should be the one rewarded if you’re taking that issue and you’re really learning from that and you’re following it through to all the endpoints and you’re getting the proper fix in place to not just prevent that one issue, but you have actually looked at holistically how to make that system better. It becomes one of your competitive advantages.

It seems like it would be hard to go from a hostile environment to one in which it was a blameless environment, especially when in both environments, your ultimate goal is high user happiness. How do you make that transition?

Wong: If you avoid problems but you need to change what’s currently not working, I think that’s fundamental. You have to be willing to go make the bigger change. People stick to just the surface change, and you stick to changes that will likely not cause other issues. But to actually fix something more fundamental, you may uncover other issues. Part of this is showing by example, and that’s why you need a system of accountability and metrics that as you fix these issues, the system is becoming more reliable.

So blamelessness allows us to have ambition, to actually uncover all the technical debt, all of the ugliness and deal with it instead of being afraid?

Wong: Exactly. If you’ve ever heard an engineer on the team say you don’t want to touch that, that’s legacy, or the person working on that area has left, no one should go make changes in that body of code, that is exactly the thing that’s there, because everyone’s afraid.

How is this going to work for mainstream IT?

Rizqi: The culture aspect of this is step one. Historically, what hasn’t existed because of the way that things have been set up between developers and operations folks is psychological safety. It’s the sense of it’s okay to make mistakes. Failures are inevitable as long as we have accountability and we can learn from these and move on. That’s a culture of psychological safety that needs to be created, and the thing that’s going to make it happen is the leadership.

It seems like a tall order for a product to deliver all of this. What does the Blameless product actually have in terms of capabilities that help companies move in this direction?

Rizqi: We have platforms that are available that make it easier to implement scrum and agile methodologies within a certain organization. It’s a combination of methodology plus platform or tooling that is going to enable the cultural shift that needs to happen. It’s never going to be one tool or platform that’s going to do it or just a set of words or documents that’s going to make it happen. At Blameless, our vision is to enable any modern company, regardless of how they’ve been operating historically, to level themselves out and adopt SRE best practices so that it’s easier for them to maximize reliability and innovation at the same time. With Blameless, what you’re getting is a platform that makes it easy for folks to adopt SRE best practices, with blamelessness built-in or encouraged through the platform, toil being automated away through our platform, and a heavy emphasis on user happiness and the ability to set those pieces up together and then driving incident response and resolution behavior differently.

What is the breakdown of the product modules that help you do this?

Wong: There are two big forks. There’s reacting to issues that impact user happiness. The two modules there are how do you resolve incidents and how do you learn from them with this idea of postmortems and post-incident learning. Then there’s the proactive side, which is looking at the system as a whole, setting service level objectives and defining error budgets before issues arise. This is where you as a team are identifying proactively, these are the areas I want to raise the reliability in and this is when and how precisely I’m going to allocate my budget to go make those changes.  The platform brings it together, bridging between this reactive piece to this proactive piece. These are four components that we have, and underlying them, we have some components around how we think about change management, which ties into the overall SRE set of principles.

I assume that for it to really work, you have to have somebody at a relatively senior level saying this is where we need to go.

Wong: It is easier said than done. Learning and improving is the key, so as long as there’s a desire to learn and improve, even if there’s a little bit of blame in that language at first, you don’t have to worry about that. As long as everyone’s goal is how do we figure out how to do this better, and many teams have that desire, you’re able to make progress.

When somebody joins a company as a developer, often they have pretty high-level skills. They can code very well. But when they’re put in the environment of a large complex system and asked to code, they don’t know what the weak parts of the system versus the strong parts of the system. Being senior means to know the defects and the strengths of a complex system and know how to work around them to get the coding job done. Is there another aspect to seniority or have I captured it reasonably well?

Rizqi: That’s a really good capture of that. Each company is its own environment with the realities, and when you’re learning from even another company or in school, you’re learning in the ideal. It’s like physics in the real world with weather, launching a real rocket into space versus launching it on a simulation. I think senior engineers, in summary, can get shit done and help their team get product out safely to the real world.

The analogy we can draw to SRE is that the entire SRE function is to be that senior mindset, to know the weaknesses at every level. So that seniority, that understanding of the defects, of the strengths, is really what SRE is about institutionally developing so that you can systematically react to it.

Wong: Yes. And it’s having that memory over the long term so that you don’t have to have experienced every single issue yourself to learn from it. How, as an organization, can everyone learn from this incident and so collectively you become more senior. That’s precisely one of the benefits.