Optimizing the Cloud: A Podcast with Dremio’s Jason Nadeau

In this latest episode of the Designing Enterprise Platforms podcast from Early Adopter Research (EAR), EAR’s Dan Woods focused on how to optimize your BI, data warehouse and data lake spend in the cloud. He was joined today by Jason Nadeau, the VP of marketing from Dremio, who discussed Dremio and the general strategy he’s developed for optimizing this type of spending in the cloud.

Their conversation covered:
* 7:00 — How to take advantage of the cloud
* 13:30 — How performance can drive efficiency in the cloud
* 26:15 — Dremio’s optimized features
* 36:00 — Dremio’s financial model

Listen to the full podcast or read an edited version of their conversation below.

Woods: Could you explain briefly what Dremio is and where it comes from and what it does?

Nadeau: Yeah, you bet. The reason we exist is to help people do analytics faster and better on the data they have in their cloud data lakes. So we all know that more and more data is landing there, it’s landing there first, and it’s difficult for people to use that data and query it with performance. We make queries there lightning fast. We have a semantic layer to provide business abstraction and fundamentally allow business intelligence, data science to happen right on the data in your data lake. And one of the things that we do related to that is help people get from their on premise implementations, which many large enterprises have, into the cloud, because that’s absolutely the future direction for many.

That’s why we’re talking about this today, because the idea that I’ve always talked about data warehousing in the cloud is that when you go to the cloud, it’s a different product. When you move them to the cloud, the elasticity of the cloud, the way that you get charged, all of this makes things a different world and you have to behave differently, first to get the most benefit out of that world, but also to get the most bang for the buck. What I want to talk about at first is the ideal product in the cloud for data management. What would you say is the way we could describe the ideal characteristics of a cloud product that would be the most cost-effective? 

You want to avoid making copies of data and shipping copies around. Copies of data, you have to store them and so that takes money. One of the benefits of cloud object storage in particular is that it isn’t that expensive. But a dollar is a dollar. Frankly, the bigger challenge becomes, trying to manage all those different copies of data. So if you’re going to build an ideal data management platform, you should, as much as possible, use the data that you have, keep it in place where it lands in the cloud and do operations on it virtually. Then you’re not making copies, you don’t have various different things to keep track of.  We also really think that the ideal data platform is open. And what does that mean? It doesn’t mean open source. It means that organizations can use the tools that exist today, the best of breed capabilities for processing that data, for querying that data, and they can use things that don’t exist yet in the future. In other words, they’re future proof. Having the data be in an open format, be fully available to any range of services is really important. A completely open architecture in general that allows best of breed processing and in today’s use cases, ways of accessing that data becomes important. It gives people flexibility, helps them to constrain their cost as well. It eliminates things like lock in.

The openness in the cloud data warehouse/data lake concept, what that comes down to meaning as a tactical matter is that the S3 objects are the canonical store and that you don’t have to load them into something else, so that any kind of service can operate on those S3 objects or whatever cloud object storage you’re using. And that seems like to me the most important principle of openness, because then you can have multiple workloads banging against the same set of data.

That’s 100% correct, Dan. Keep that data in one spot, in an open location, in S3. Don’t copy it and move it and adjust it into something where it is locked away. And when you really get into the way that the processing happens, or the queries happen in Dremio’s case, this comes down to compute and the way the applications that do the work are built. So much of the software that runs in the cloud today actually doesn’t take advantage of the inherent elasticity that the cloud makes available. If you think about what the cloud really does, it’s like a collection of different primitives that, if you really want to take advantage of them, you’ve got to rewrite your application, and re-architect it to use those. And that involves a lot more scale out and a lot more massive parallelism. The cloud is built differently than your traditional on prem scale up mostly infrastructure. One of the things that happens that software doesn’t do is it doesn’t let resources go when they’re done being used. And that’s because so much of the heritage of application design comes from the on prem world, where your infrastructure is fixed. People build their applications expecting that infrastructure is going to be there and they’re not going to shut down their processes because there’s no need. So even if things aren’t being used, if the software is not actually being used, it’s still running and that consumes cost in the cloud. So the ideal solutions in the cloud are fully aware of the fact that they should only be consuming compute cycles when they’re actively being required to run. 

But there are two aspects to this. One aspect is that you should be able to go dark, so if you’re not doing anything, you should be able to just pay for the cost of the storage and not have your computing workload processing systems going. The second thing though is that you should be able to bring up different shaped units of processing, different shaped configurations of your application to do different workloads. The ideal products in the cloud actually have that characteristic.

You’re absolutely right. I would take what you said about providing that visibility and granularity as a double click into are you actually being elastic? Assuming you have some elasticity, that’s great. But you also won’t be able to fully take advantage of it if you still have software that is architected to be still fairly monolithic. What happens — and this is very common in the world of queries and, you know, the space the Dremio plays — is you’ve got query engines that have a single execution environment. When they’re running, they’re supporting many different query workloads all at once. And when that happens, you don’t have the granularity to ever stop anything because something’s always hitting the engine. So it’s sized to be too big right out of the gate, because people are having to size for peak demand across a diverse range of query workloads. But it can’t really ever shut down. Another way of thinking about this notion, and I even just alluded to a little bit earlier around being very scale out or massively parallel. You can’t just be monolithic. You’ve got to have lots of granularity across many dimensions, and then you can take advantage of the underlying capabilities that the cloud provides.

So we’ve got no copies, open use and expose elasticity to benefit the user. If you have cloud products that did that, that goes a long way, but there are more. What are the next ones?

Another thing that becomes relevant in the cloud in particular, where resources are ephemeral, is performance and speed of operation. This also goes back to on prem infrastructure, where you’ve got your equipment, it is what it is, and you’re trying to get the most performance out of it that you can. Because that infrastructure’s not going away. You’re already paid for it. In the cloud, you get to do things differently. And because that equipment is virtual, it’s ephemeral, then performance all of a sudden becomes a way of being more efficient, in the following sense. If you can complete a task faster, then you can stop the execution of that task and shut down the resources associated with it. You can also size the compute that you need to do a job at any given performance level to be smaller. You might be fine with the performance that you have, but the more performance you have, you can then essentially convert into lower infrastructure costs. Because you can size your engines down smaller and you can use them for less time because they’ll complete faster and you can shut them off. The ideal platform is really fast because that then also drives down infrastructure costs.

The idea here is that not every cloud product that you use like the data warehouse, data lake, data management has these characteristics. They’re not open. The elasticity is opaque or doesn’t benefit the user. The speed and efficiency falls short. So the goal is, of course, to get an A-plus in each one of these things. Now, I want to take a step back and talk a little bit about Dremio.You guys have constructed some interesting properties that I think would be useful to people who are building their own data supply chains or data lakes. I’ve been writing about Dremio and talking to people at Dremio for many years. I’ve always been interested in it because the whole idea was that you would have a landscape of data that you could then pull in. Dremio has built a bunch of constructs. I think the most important one is called Reflections, in which you can then take a bunch of data from an arbitrary number of sources, pull it in, and then, using SQL, you can do joins against it that are not dependent on the underlying source. And this where you get virtual transformation of architecture, because all of the data that you land in Dremio, you can then start defining virtual transformations so that then you can access these purpose-built output datasets that don’t really exist until they’re materialized on demand. When I talked to Dremio, it was all about being able to do this across a landscape of datasets. Now it’s really mostly about supporting that data lake case, which is the data is in your cloud S3 object storage or the other object storage on the other clouds. Could you talk a little bit about how the evolution of the product came to realize that the cloud data lake was where it’s at and then the way you changed the product to actually support that use case better?

You’re right. If you go back in time, there was a more heterogeneous set of sources for data. Back then, there was a greater demand from our early customers to provide access to that data where it sat. There was still quite a bit of data that was sitting in operational sources outside of the lake and they wanted to pull that in. And we helped them to do that. That said, that’s still not the best experience for customers, right? You know, anytime you’re having to go and touch an operational source, you know, you’re at risk of things being slower. So fast forward to where we are today and data lake storage—in particular in the cloud, S3 for AWS and ADLS for Azure, for example—that is really now the primary place that data is landing. And so it’s either getting there first from the applications that are generating data or organizations are already building or have built data pipelines from operational sources into the data lake. So the data lake is now the canonical source. But that’s where the data is. That’s good news for customers because it’s starting to provide, and actually is providing for many today, this single place where they can go and start to use their data. They no longer need to also or separately ingest it into some other source or other store like a traditional data warehouse.

One of Dremio’s core dogmas is that you want to separate query performance from the underlying source of the data. You’ve realized that the  most important source of the data is now in object storage. And you were already well positioned to deal with that. And so now that’s one of the things that, you know, if people are planning on moving toward a data lake where instead of storing it in expensive infrastructure, you store it in a much cheaper form and then create whatever transaction engine and compute engine you need, that’s the kind of product that Dremio wants to be. It’s not like all of that S3 storage is only used by Dremio, right?

That’s absolutely correct. And that goes to the whole point of an open architecture, where you can bring best of breed processing engines to bear. The data is now in these cloud object stores. That’s what we were always built for from an internal architecture point of view. Our life has gotten a lot easier and our customers are having a much better experience now because their data is in a more centralized location. We’ve optimized and added other optimizations along the way, like acceleration technologies. For example, we have something called a columnar cloud cache. We’re end to end columnar. There are a bunch of things that we’ve done specifically for the cloud, like predictive pipelining, that we have that allows us to be really fast on reads for these queries. We’ve really optimized for performance on cloud data lake storage. Now, that said, there’s still going to be some situations where people have data in additional sources outside of the data lake. So we do still maintain connectors and provide support for people to use those where they live. And so in the cloud world, maybe they’ve got some data in RDS or in Redshift and so we can still connect to those things. But more and more, the whole architecture is centered around a big center of gravity on the data lake storage environment.

In your latest release, you’ve added some stuff that enhances your support for the perfect product we talked about at the beginning of the podcast. What are they and what was the motivation for them?

So the motivation is, first of all, performance. And then efficiency directly. We were looking for ways to drive out the cloud infrastructure cost and make that compute as efficient as possible, taking advantage of the elasticity. The third factor was fundamentally around consume-ability and ease of use and really creating a fast onboarding experience for people that are new to Dremio, giving them an experience that is very much like SaaS, but without actually being SaaS. And this is intentional because the other thing that we’ve found a lot of enterprises want is full control of their data platform and they don’t want to cede control to a SaaS vendor and send their data through into a SaaS vendor. Many want to have their data live in their own cloud account. 

Especially as the data gets more valuable and security gets more important and that a data breach would cost more, you don’t want to necessarily trust that the operational staff of the vendor is going to really get the job done. And so you can keep control of that, but then what you’re saying is that the best way to do this would be to offer an instance where what gets installed is managed by the vendor, but then it can be installed automatically at any time. Do you do this through machine images or have you achieved this level of support yet?

That is fundamentally the new capability that we have that we recently brought out. We created a specific version of Dremio heavily optimized for AWS and we call it the Dremio AWS edition, not surprisingly. There are a couple of big capabilities inside that edition that make this all possible. The first is something called elastic engines, which is about providing that granular compute performance and ability to elastically spin up and down for the individual query workloads that you have. And then a second is something that we call parallel projects, which wrap a whole bunch of automation around the software experience of Dremio and, all of the resources, anything created by and persisting by Dremio in S3 or EBS, for example. It’s all in one contained zone and that way business units can have their own compliance and keep full control of all their data.

You can say here’s my Dremio zone and any working S3 objects or whatever you need to create Parquet files or whatever you do with your reflections, you can do there. But then you can have that be happening there and then another S3 zone where all the crown jewel data is.

Yeah, that’s correct. Dremio doesn’t ingest data. We’re not making copies and managing copies and doing things the way that other people do. So the data is going to stay where it is. But we have to take care of our own stuff. Again, this is part of being end to end columnar. This capability called parallel projects provides that for us. We have automation around the deployment initially through the AWS marketplace. We have automation around that configuration of Dremio to do things like use and configure the columnar cache that I talked about, as well as data reflections, and get those set up to work on top of S3. Even backup, like we configure backup for our projects. And also upgrades themselves. We make it very, very simple. We think our elastic engines are going to be very exciting for people because, any time you can save a dollar, that’s great. But boy, if you can save 90% on your cloud infrastructure cost compared to traditional query engine approaches, that’s pretty darn interesting. And so that’s what we’re doing with elastic engines. They’re multiple engines. We move from a world where you have one, right to a world where you can have as many as you want. There’s not a limit. And then those individual engines, therefore, are sized appropriately. And this goes back to one of the other things we do. We’re super fast. But the fact that we can be this extremely granular with the engines and now they get sized just to fit any individual workload, so there’s no more overprovisioning—and no more under-provisioning too, for that matter means you get to size just the right way. And then we have automation on top that does elastic scale up whenever we see queries coming in. That engine runs for as long as it needs. When query activity stops, we automatically elastically scale it back down to zero.

This turned out to be an advantage because a variety of people are using software that was built and adapted over time that isn’t necessarily architected in a way that is intended to optimize the S3 storage as the canonical storage model. And so you said that you have a cost advantage about other engines that can use S3 as a canonical source but don’t have the other properties. 

That’s absolutely true. We’re extremely easy to get up and running and keep running. We’re providing a service-like experience. But then just from a pure efficiency perspective and performance perspective, we’re on average 4x faster, thus, a quarter of the cloud infrastructure costs in order to get an equivalent job done. Then you layer on the new elastic engines and the fact that we eliminate all of the overprovisioning and we only run when there’s actually query activity happening, we cut out another 60%. You remove 60% of the remaining 25% and you’re back down to 10%, which is what results in the 90% savings. It’’s a huge reduction in cloud infrastructure costs that frees up money for people to put into other things. And that’s typically what happens. 

When you do start using the full Dremio product, how do you capture value? What do people pay for? 

We get to do things differently and we get to be aggressive too. Because that shakes up markets. And so with this new AWS edition, we’re really excited because the product is free for any scale. There’s no limitation on the number of users, there’s no limitation on the number of queries. All of our acceleration technologies are there. This isn’t a promotional offer, like that’s just free period. And many users are absolutely full on in production and they don’t pay Dremio anything. And all these benefits in terms of cloud infrastructure savings we talked about, they get all that too. The way we make our money is when people expand their usage, typically across multiple departments. And then they say, okay, we’ve got lots of different teams and different departments, we want to create some standardization, we want to make sure that we’ve got role-based access control and other enterprise security wrapped around the overall usage so that they have more governance on who has access to what data right across the organization. That’s when they need to actually sign up for a subscription. So it’s not because of the query work. It’s not because of the analytics work. It’s really security. 

And when they do subscription and support, do they sign up by seats or do they sign up by data volume, or how does that work? 

They’re signing up essentially based on nodes. And in the future, this will become compute hours. But essentially what they’re paying for is the amount of compute that they’re using for Dremio. For the free version of Dremio, the markup is zero. But even our paid customers, we’re driving down the amount that they pay to us with these new elastic engine features. And that’s okay. We’re fine with that because we know that that’s the right thing to do. Like the architecture of the cloud should be, and the right way to use the cloud is you pay for what you actually use. 

Who are the people that come in, see this, try it, love it and then want to use it as, either a free or commercial customer? What are the sweet spot use cases?

There are three main use cases. First, anybody who wants to do business intelligence and they want to do it directly on their data lake storage, that’s the number one sweet spot for Dremio. They’ve got a bunch of data, it’s sitting in the data lake. They’ve had a very difficult time trying to query it directly and so they’ve been going through all sorts of hoops, moving data into data warehouses or just suffering with slow performance with other query engines and whatnot. Second is data science. We make it very easy for data engineers to collaborate fast with their data scientist counterparts, to get access to wide ranges of new datasets. And then third, migration from on prem data lakes to the cloud. Whereas the first two use cases really could be any size, large enterprises for sure have built on prem data lakes, on technologies like Hadoop and HDFS and, the vast majority have a cloud strategy to move to S3, to ADLS, and do things in the cloud for all the reasons we’ve talked about. You can be just so much more effective and efficient once you’re there and when you do it right. 

What you’re saying is when somebody’s done the work to actually declare and define and get everybody to understand a data product, you might access that data product through a bunch of reports that are created by a BI system in a data warehouse. But you can also move that whole data product into Dremio and Dremio can replicate the delivery and you can deliver the same data product to the same people, just with a cheaper infrastructure and a better infrastructure.

Yeah, that’s right. And one of the fundamental benefits of Dremio is we make this very self-service too. We make it very easy to expose that semantic layer for lots of BI users and data scientists to come in and just access the data themselves. And the way they do it is via virtual datasets, because, again, we’re not making copies. Everything is a virtual transformation. The semantic that gets built up by data engineers, the virtual datasets on top of virtual datasets. All of that is the semantic layer. We help people build that out on prem. We help them to take that to the cloud. 

And in terms of your financial history, you’re on your series C round. You had a $70 million dollar round led by Insight Partners. And when was that?

We had a big round. It was exactly the amount of money we wanted to raise. Insight Partners based out of New York was the lead investor here. That was just in March. It was a real vote of confidence in our financial performance and the success of our business and our prospects as well. Given all the trends to move the data and of landing in the cloud and the open architecture approach, which is really the way that people are building their modern data platforms these days. We’re financially on a very sound footing. We’re using that money to accelerate a number of different things, in particular the R&D side.