The More You Know: A Podcast on the Benefits of Observability with Grafana Labs’ Raj Dutt
On the latest edition of the Designing Enterprise Platforms Podcast from Early Adopter Research (EAR), EAR’s Dan Woods spoke with Raj Dutt, CEO of Grafana, about observability. It’s a concept that Woods has been studying since learning of what Etsy did to make the development process faster. Etsy created huge amounts of observability so they could actually make many changes every day and if anything went wrong, they would find out about it immediately. This practice has become popular in DevOps and in monitoring operational environments. Dutt has been working on software structure that supports observability. Their conversation covered:
- 4:30 — The origins of Grafana Labs
- 18:20 — Grafana Labs’ focus
- 39:00 — The benefits of a platform approach
- 47:00 — The challenges of this approach
- 51:00 — The positive and negative trends in open source right now
Listen to the podcast or read an edited version of their conversation below:
Woods: Why don’t you tell us what observability is and why it’s important?
Dutt: Observability is decades old and comes from the control theory world. It’s been co-opted in the software world to mean a new way of monitoring and understanding complex systems and applications infrastructure. A new word was needed because there has been a fundamental shift in how to deploy, monitor, and support applications. Etsy has been a pioneer of this way of thinking; it’s all about being able to empower developers and operations teams to deploy their software often safely and preserve user experience. But for Grafana labs, observability practically means we bring telemetry data together under a seamless experience to help people troubleshoot, understand, and explore the data that is coming out at an increasingly rapid rate from all systems and applications. By telemetry data, we mean the fundamental building blocks of understanding these complex systems, which for us boils down to metrics, logs, and traces.
I agree that a new word is required because when I first visited Etsy’s office, I saw all dashboards that were tracking different things. When I asked about the underlying metrics, they said there were thousands of them. This was different than anything I had seen at any operational environment in a data center. So I agree that observability is a different thing than just operational monitoring.
Whereas the old IT operations monitoring were about checks and statuses of things, now systems are so complex and there’s so much data, it’s really a data analytics problem.
When did you found Grafana and why did you decide that you needed to?
The story starts with our cofounder, Torkel Ödegaard, who created the open source project in Sweden over the Christmas of 2013. We’re Grafana Labs, the company behind the Grafana project. The company started about a year after the Grafana Project was started. My third cofounder, Anthony Woods, and I had just come off running a hosting infrastructure company and we had seen this fundamental change in terms of monitoring and observability driven by the volume, the velocity, the variability of all this data. We discovered the Grafana Project, which was just starting out as an open source project. And Torkel, Anthony, and myself met up in 2014 and the timing was really good. The Grafana project was starting to see some initial adoption that was really promising. Grafana Labs was formed to accelerate this project that Torkel created and we didn’t really have a business model or anything initially other than just this idea that there was a fundamental tectonic shift happening in the way people monitored and needed to understand their systems.
How do the artifacts of observability, the dashboards and all the monitoring, affect what goes on in an organization?
Dramatically. Across practical, operational aspects and capabilities, and culturally by breaking down data silos, and empowering teams.
At Etsy, the way that I had understood it, is that they would push changes all the time, but then every once in a while they’d push a change and the dashboards would go crazy. They realized, “Oops, something we didn’t understand just happened.” So it made it safe to push those changes because you would find out right away what happened. Observability gave confidence.
You hit it on the head with the ability to deploy whatever you’re innovating on very confidently and quickly and be able to see exactly what’s happening and roll it back. It really puts a lot of demands on how you need to observe your systems. It needs to be real time, it needs to be comprehensive, but all of the supporting capabilities around that, including how you develop and deploy and package your software and run your infrastructure, have to also be in place in order to achieve that nirvana.
What are the other venues in which having these detailed metrics being monitored at scale changes something that would matter to somebody in a company?
What you described is around an outage, like fixing a problem. Like reduce MTTR is the headline there, but there’s also huge cost savings with doing observability. An example would be of people using things like auto scaling and usage-based cloud consumption models. And in order to do that cost effectively you have to monitor everything in real time and correlate different metrics and logs. The cloud providers don’t make this easy, deliberately, and I think that cost savings in terms of optimizing your infrastructure, that’s another big enabler of having all this data together.
It becomes a nervous system. You can detect when more traffic’s coming in and you can then trigger auto scaling of the infrastructure. Then you can detect when that’s done and then trigger reducing the scale of the infrastructure. So the idea is that it creates a much more detailed model of activity that you can then respond to.
Exactly. Like what you described in the beginning as far as fixing an outage, that happens a fair bit but what happens more often is it’s less obvious than that. There’s a particular issue that may not affect a bunch of users, like a 99th percentile latency type issue or an issue that’s affecting an important query somewhere in the stack. And being able to tie together metrics, logs, and traces in an experience that’s seamless and allows you to look at an individual request, that’s really important. It’s not just about metrics to see something’s wrong, but actually being able to look down to an individual request so that if you’ve got an important customer who’s having a problem, you need to have the level of customer support. The ability to do that puts tremendous demands in the level of information that you need to collect, the level of correlation that you need to do and you need to do that in real time.
If it turns out that there’s a problem affecting a customer, you can quickly get an idea of who that customer is and what their value is and whether you should react to it quickly. If you make a change and it has impact, you can determine the impact of that change for good or for ill. So it’s really about creating a very detailed model of your business that goes from a low level of granularity to a much higher level of granularity. But why should a CEO care about it? Because you can be managing something that is far more detailed than just the coarse metrics that people had in the past?
Sure, and at the end of the day so many companies have a web application or a mobile application or an API that is super important to the business. Essentially, if you care about things like user experience, performance, availability, at the end of the day you have to have someone caring about your observability story. Because the reality is the way we build applications today is much more complicated, much more distributed. I think the nervous system analogy that you drew is very appropriate. It’s very hard to understand it without really being able to look at the whole system and correlate data and be able to analyze disparate things. Why should a CEO care? For the same reason they care about a viable software strategy.
In the minds of CTOs that I talk to, observability is associated with DevOps. Why is it so connected with DevOps and is it only usable in a DevOps context?
I wouldn’t say it’s only usable in a DevOps context, but a lot of the agility that we talked about in terms of reacting to change, like the Etsy analogy that you brought up, can only be fully realized if you embrace fundamental organizational changes beyond your observability stack. Observability definitely, as a concept, is applicable to even traditional organizations and traditional software teams, but it really comes into its own in terms of realizing all of its benefits if you are more agile and if you do things like continuous deployment.
The difference between traditional monitoring and observability is that in traditional monitoring you have enough metrics that maybe a human could actually understand them. In observability, you have potentially thousands, hundreds of thousands, or, as you mentioned, even millions of metrics, and the picture that you’re drawing is one that you’re dealing with an aggregate. Then you may dive deeply in to use those metrics to create a very detailed model for various business purposes.
Now let’s talk about how the tactics of observability have changed over time. What does Grafana do, and how do you build up observability?
It’s all about the ecosystem—that’s our view at Grafana Labs and is also the stance of the Grafana open source project. The Grafana open source project, in order for it to be useful, in order for it to become an observability platform, you need to select your choice of things like metrics, back ends, logs back ends, soon to be traces and there are many open source projects in that ecosystem. There are commercial vendors in that ecosystem, there are SaaS vendors in that ecosystem. What we’re all about as a company is providing choice to our users and our customers so that they can compose a platform of whatever back ends and vendors that make sense to them. Our view of the world is that we’re not a database vendor and we believe that your data will always exist in different databases and will be disparate and your ongoing consolidation play is probably going to be ongoing for a very long time because there’s never going to be a single database to rule all this data. And so all these open source projects that are in the nest, include projects that you would know like Elasticsearch, Influx Data, Prometheus, and Graphite. We are involved very heavily in some of these communities and projects such as the Prometheus project or the Graphite project, but it’s really important to us as a company that we provide first class integrations to all these different providers.
What you’re saying essentially is that imagine all of the world that is being monitored as pumping out streams of data. There are a variety of companies like Sumo Logic or Spunk or a variety of others that are good at monitoring that data and then asserting search criteria so you can find out whether a metric derived from that data is within a certain boundary or has exceeded one boundary or dropped below another boundary. Observability is about wherever you get it, and maybe that metric is coming from a real-time piece of hardware or maybe that metric is coming from a cloud-sourced level of good collection, wherever it’s coming from, Grafana’s there to accept it and then start managing and integrating this complex of metrics so that you can start making sense of them at scale. In a world where you only had commercial projects collecting data and defining metrics, Grafana could happily sit on top of that.
Exactly. The key thing here to remember, though, is a lot of other vendors will make similar claims but they’re still database vendors. In order to use their platform, you have to store all your data in their back end. Grafana allows you to use your existing back ends and not move that data, not ETL that data out, not batch load that data into Grafana. Grafana will, in real time using the native Splunk APIs or the native Elasticsearch APIs, pull the relevant data when you need it for the analysis that you need, for the dashboard that you loaded, for the alert that you need run. We don’t have a rip-and-replace mentality.
Let’s go through the nest of open source projects that are related to Grafana and say what each of them does in a complex.
In terms of the inner circle of projects that are the most popular within our ecosystem, the first one would be the Prometheus project, which is part of the cloud native compute foundation, and that’s a project that we have become heavily involved with over the last few years. Prometheus is a monitoring system for metrics, primarily for cloud native environments. Graphite is a time series database also for metrics. Elasticsearch is a mature project primarily for logs but also moving into metrics and it’s got very powerful search capabilities. InfluxDB is primarily a time series database also used for metrics. And those are probably four of the most popular databases for Grafana. We’ve also added SQL support. And Grafana is basically software to allow you to visualize and analyze your data from all of these different databases. And so the four I mentioned are some of our more popular ones, but there are 42 different databases that Grafana can sit on top of, and that includes SQL also. So a lot of people use popular SQL databases like MySQL, Postgres, and we’re seeing a lot of our users connect up those databases also.
Let’s say you’re in an environment and you’ve understood that you want to build more metrics and more monitoring of it. You then start using Grafana to access all of your data to define metrics. The first six months, you get 1,000 metrics defined. And then the second six months, you introduce it to two or three new departments and all of a sudden you have 5,000 metrics defined and then the whole company gets really excited about it and now at the end of the year you have 50,000 metrics to define. Now, we all know that in cybersecurity or in operational monitoring, there’s that problem of, “Oh, my God, all the alarm bells went off at once, what does it mean?” And so it seems like what Grafana is doing is saying, “Look, you’ve got this huge amount of metrics, we are going to do something to organize them so that now the information collectively from all of those is going to come out, and then if you need to dive into detail, you can do it.” What do you guys do to make that happen?
It’s all domain-specific. The raw data is pretty useless, but depending on what you’re doing, the aggregate views can be very useful. For every server you have you would probably collect dozens of metrics each. But you would generally never look at the individual CPU usage of a single Docker container unless you were troubleshooting something. So it’s all about starting with top-level stuff and creating both dashboards and exploratory views that show high-level status and allowing people to drill down lower and lower. And there’s a point where metrics only get part of the way. The whole experience of finding a needle in a haystack, generally starts with an alert, you look at broad metrics that are generally dashboards that are high-level status of your systems, and then you’ll generally dive down to more detailed metrics and then you’ll switch to logs. And then you’ll probably switch to traces. That whole experience of contextualized switching within these observability primitives happens within Grafana.
At the top level you have these large summarized metrics and then you can drill down in them and each one of those may have five or six components, and then you can drill down into those components and those components may have components arbitrarily large. The idea is you create this tree and Grafana helps to categorize all these things so that you can create component metrics and build this tree structure out of it.
This whole structure generally takes place across a variety of dashboards. We’re adding more explore functions to allow you to dive off and do ad hoc analysis more easily. We have a methodology within the observability world and we call it RED. RED stands for rate, errors, and duration. Any system you think of, you can boil down to those three things. How fast something’s happening, like how many people are getting served coffee. Errors—how many problems, if any, are happening, how many people are getting wrong orders or incorrect change. And duration—how long are people having to wait. We think that conceptually those are the high-level things you want to look at. As soon as you see one of those three high-level metrics change, that’s your starting point, and where you go from there is generally dependent on your system, what you’re monitoring, how your application works.
Essentially what you’ve done is you’ve productized what these early adopters, these pioneers did through their own engineering teams. Now, that has become, as many open source programs do, something that resembles a product rather than you having to start over. You mentioned several times that you think observability has become a data and analytics problem, what do you mean by that?
You alluded to it when you said, “We have all these millions of metrics.” The scale is going through the roof. It’s no longer about an individual metric or one thing being off or something like that. It’s both because of the scale that we’re dealing with, but that scale is driven by the complexity of people’s infrastructure. People used to have a few dozen servers sitting in a rack, a co-lo in a data center somewhere. And then it went to a few hundred VMs with VMware, Zen. Then it went to thousands of containers and multiple availability zones and serverless and so it doesn’t matter anymore what a particular metric is. What you care about is some deeper levels of analytics. In order to understand it you need to ask more complicated questions like, “Hey, across the customers I care about, how many of them are having an elevated bad time right now in the last five minutes based on this?” And you may have to touch thousands of metrics to answer that question, but you don’t want to see any of those metrics any more. You just want to be able to ask the question and understand the system in the way that you want to. These systems have become like organisms that you have to look at the health and the state of in aggregate, through data analysis rather than the status of any one metric.
You have to create a semantic model over the whole system that you’re talking about so you understand what everything means and what story is being told by the data. But then you also have to create some level of significance so that you have some way of judging the significance of all of this stuff in your semantic model to determine whether anything was really important or not and then you could start paying attention to the details. It sounds like a big part of this is defining the normal state and then creating the boundaries for the deviation from that state. Given what we know about machine learning and AI and those methods being great at finding complicated patterns of normal and expressing deviations from that, it seems like it’s inevitable that that’s going to be playing a role when you’re using these metrics at scale. Have you guys started on that journey already?
We’re experimenting with and playing around with things that do more predictive analytics or pre cog. The larger point, though, and the bigger point is most organizations today aren’t even using best practices, the kind of stuff that we both read on Etsy’s blog almost a decade ago. So we want to talk about ML, AI, that’s great, but even the idea of establishing what’s normal and what should be monitored and what the high-level metrics are, there’s a lot of room for improvement.
Using the metaphor we did earlier, there’s a lot needed to be done to grow the nervous system before you start making the brain better.
Yes.
What does Grafana mean by a platform approach versus consolidated approach and what’s the problem with a consolidated approach?
Consolidation meaning you’re going to move all of your data into a quote-unquote “next gen” database that’s going to handle all of your use cases for observability. Whether that’s a SaaS vendor, whether that’s an on prem vendor, whether that’s a vendor that’s coming from logs or metrics or traces or an open source project.
It means a productized approach from the dashboard all the way down to the metrics collection.
Correct. Like a monolith would be another way to put it. Like everything in one box, the opposite of compose-able.
What’s a platform approach then?
When we say “Platform approach” what we primarily mean is compose-able, meaning the inner operability that we provide with your data wherever it lives. So you can connect any and all of the several dozen data sources that work with Grafana or write your own data source since they’re open source and basically compose an observability platform that is not a monolith. There’s so much innovation happening within the open source ecosystem with projects like Prometheus and Elasticsearch, that we think in the open source world a monolith consolidated approach is just a nonstarter. We want to be able to leverage all of the innovation that’s going on these early days when it comes to metrics back ends, logging back ends, tracing back ends.
What you have is a product that is partially a tool kit for creating different solutions, it’s not necessarily a configurable product itself?
You need to add to it to make it complete. But to complicate matters further, we participate very heavily in the Prometheus project, which is a metrics backend that is a popular database for Grafana. We also launched a logging project, open source, a year ago called Loki that is a logging database in some ways similar to, say, Elasticsearch or Splunk. The point of Grafana is all these data sources play equally, so Grafana’s truly database neutral. We still consider ourselves to not be a database vendor, even though we’re involved with some database-like projects.
Can you construct or tell me about one kind of heterogeneous use case where it really made sense to use different platforms or different components?
The obvious example would be something like Graphite for your metrics and something like Elasticsearch for your logs. So you want to bring metrics and logs into one experience and one view so you’re using two open source projects for their individual strengths and capabilities and the visualization layer for Elasticsearch only works with Elasticsearch. Grafana can help do that. Or you mentioned, say, Graphite and Prometheus, you look at the history of the Graphite and Prometheus projects, that’s actually a common situation where you have a Graphite installation that’s been in production for years, is running fine and then you have a team that’s playing around with Kubernetes, and if they’re playing around with Kubernetes, they’re using Prometheus. So boom, you’ve got Prometheus and Graphite running, what do you do? A lot of times you want to bring those metrics together, and so Grafana is the answer to that. Most Graphite users are already using Grafana and most Prometheus users are already using Grafana too, so it’s kind of an obvious play.
Now let’s tackle more thorny issues that you have to deal with given the business model you’ve chosen. There was an article in the New York Times in the last week or so about how Amazon is being criticized by a variety of different open source vendors for what they say is “Strip mining the value of the open source projects without really providing much return for the community.” If you’re going to do a product, a commercial product, based on open source, you have to have not only a model for creating the software, you have to have a model for capturing the value. Now you can have complex relationships depending upon the structure of the company where if it turns out that there’s an open source community and then a separate commercializing entity, that can create tensions of the sort that we’ve seen with Redis Labs and the underlying open source community that creates Redis. So, how am I going to do that and stay a legit open source company instead of just being open source as a distribution model?” What is your approach to threading all these needles?
That’s a really good question, and one there’s no easy or glib answer to. We differentiate our open source software with commercially licensed software that is not open source, it’s called Grafana enterprise. “Value Capture” is a great word—open source itself has never been about value capture, it’s been about value creation. We are very judicious in what features we hold back for our enterprise version, but there are features that we do hold back and the enterprise version continues to be differentiated. The features that are in enterprise will appeal to the largest companies in the world, like our top one percent of our users. As far as the whole strip mining argument, I don’t think it’s fair to characterize it as a complete negative, as you have to acknowledge the value creation happening if it’s done in the right way. I would like to think that there’s a way that the relationship between some of the cloud vendors and some of the open source companies can become something that could work in the long term where the innovation and the community could be realigned, but for us I would say that we have our commercially differentiated software and we will continue to do so because we look at our open source projects as primarily about value creation and when you try to have both, I think it really complicates things.
What do you see as the positive and negative trends going on right now in open source for the enterprise and with respect to the public cloud?
For open source and the enterprise, I think it’s the best time ever, now, to be an open source company in the enterprise software infrastructure space. What’s changed is certainly within the observability tool set or ecosystem as the cutting edge stuff now is all happening in open source, whereas 10 years ago open source was your cheap and cheerful alternative to the commercial vendors. And with the larger vendors, it is encouraging to see, that whether it’s data analytics or observability or open source developer tooling, they’re also becoming more involved in the project because they see that as a way that actually helps them with their use of the project.