Dremio’s Core Dogmas: A Q&A with CEO Tomer Shiran
In a recent edition of the Designing Enterprise Platforms Podcast at Early Adopter Research (EAR), EAR’s Dan Woods spoke with Tomer Shiran, the CEO and founder of Dremio. Their conversation focused on how to create a data lake for an end user, a topic that anybody who’s worried about BI and analytics should be interested in.This podcast series looks at various ways of understanding technology and how to combine technology to create platforms to solve vital problems for the enterprise. There probably is no more vital problem than how to actually get your data infrastructure correct. This is an edited version of their discussion that can be heard in full on the podcast.
Woods: I find Dremio fascinating because it’s got a unique perspective on how to make a data lake work. I’d really like to kind of cover some basics about why data lakes are here to stay and why it’s really becoming the concept that has become the new paradigm for how you organize your data in a large way to solve problems. Why do you think that the data lakes are here to stay?
Shiran: There’s been a lot of history around this but if you think back to where we were maybe 15 years ago, many companies had one database and one BI tool and it was as simple as that, maybe you bought that from Oracle and that was it. The world has really evolved a lot since those days and now you have many different databases, data storage technologies where data resides, it’s very heterogenous in most companies. You want to be able to analyze all that data, bring it together and be able to ask questions of the data regardless of where it sits. I think we went, kind of after that really monolithic era, we went to an era of data warehouses, where we had ETL tools and companies would develop all these ETL scripts and pipelines to move data into a data warehouse. That worked really well for a while but then we started to get a point where we value data agility and self-service a lot more than in the past and BI tools such as Tableau and now Power BI and Looker came in.
One of the differences is that the number of types of data and the number of sources of data that we had in the data warehouse era have just exploded. Companies routinely want to get to data that has a variety of different structures. The ETL to make that happen in a data warehouse would be even more nightmarish than what is currently in place. I think that that was also another reason that data lakes are here to stay, is that they handle that heterogeneity and scale better.
Yes, absolutely. It’s the type of data as well so things like WebClicks and semi-structured data which didn’t necessarily fit well into traditional data warehouses and then you had NoSQL databases and SaaS applications.
Many of the lake implementations are overcoming the deficiencies in the access methods for each source so that you can have a pleasing experience. And the idea is that if your data’s in a flat file or if your data’s in a repository that might be slow, some of the data lake technology is actually overcoming that, caching it, doing whatever is needed so that then you can have a good join experience and a good query experience even though the underlying source doesn’t really offer that.
Yes, I think that’s 100% true, many of these database technologies or these data storage technologies, they were designed with a specific purpose in mind, whether it’s to be able to store semi-structured data or JSON documents efficiently or whether it was to be able to do search efficiently or just to store files which maybe you could easily obtain from your partners or other vendors. But those data storage technologies and formats weren’t necessarily the most efficient for analytics. And on top of that, those systems might be serving some OLTP or operational use case where you really don’t want to be performing any analytics on them. And so I’d say the data lake and kind of related technologies allow you to be able to explore and analyze that data at an efficient speed without the limitations of those various source systems.
Let’s move on to why end users should be interacting directly with the data lake. Because in many implementations of data lakes, data lakes are sort of like a warehouse that’s not really intended for end users but it seems to me in most of the implementations I’ve seen that have been successful, it’s been clear the biggest way that this can create value is when everybody starts using the data lake. Why do you think that direct use is important?
If you think back to why we created data lakes, the vision was always been what if we could get all the data in an enterprise or in an organization into one place and allow users to then be able to access and analyze that data, that would be kind of the holy grail of analytics? It would break down the silos and the business get a tremendous amount of value from all that data being accessible. I think it’s both scalability and the ease of being able to have access to that data.
The way I see this working out is that if you look at most companies, whether it’s explicit or whether it’s implicit, is a de facto set of data that I call the data product. The data product is the data that everybody understands, that is delivered in regular updates, that comes to you in reports or dashboards that become part of the culture. The data lake is going to play an important role in delivering and creating the data product—but in addition, you have that ad hoc analysis, you have the desire to use raw data in various forms, whether it’s the landed data that just came from the sources or the modeled data that was put together to collect things or the purpose-built data that is built for a specific type of analysis. What is going to be needed inside that data lake to allow that work of solving problems to happen?
We’re living in a world now where more and more people want to be able to do that ad hoc analysis. A lot of people now, they leave college with the ability to write SQL and write Python. And so in order to really enable the data lake to serve all these users, we have to make it easier for them. One is make the access intuitive, provide kind of a logical layer that they can interact with very easily, collaborate with their colleagues, build these new virtual data assets. And then also make sure that things are fast enough because we often talk about performance as a separate thing but if you’ve used a BI tool such as Tableau and you drag the mouse and it takes 10 or 15 minutes to come back every time you drag the mouse, that’s really impossible to use.
Now let’s get to Dremio. Basically what Dremio does is it supports this ability of creating a data product and delivering that as a regular production pipeline. And at the same time with the same tools, it also supports the ability of doing ad hoc use without creating a mess. And when I say mess, what I mean is a set of disconnected extracts of data like have when people use lots of spreadsheets, where we don’t know the provenance of the data, we don’t know the freshness of the data, we don’t know really much about it all except that somebody had grabbed it and put it together. And so if we can avoid those extracts, if we can do ad hoc work and then have that ad hoc work then be moved into the data product, we’re really ahead of the game. Can you talk about the vision behind the actual technology and implementation of Dremio to bring this vision to life?
The value proposition of Dremio is to make that data lake accessible, intuitive and fast and to be able to bring together the data without having all these different copies of data and having to create data marts and build cubes and build extracts. When you have lots of copies, you lose data governance, and those types of things also make it impossible to achieve self-service because when the stack is complex and you have all these derivations of the data and create copies and create extracts, there’s no way that a BI user could actually do anything on their own.
Let’s go through the assumptions that you made and the beliefs that you have about how to build the right solution that we’ve created in that article we did in November 2017 called “Can Dremio finally deliver on the promise of self-service data?” In that article, we explained what Dremio believes are its core assumptions about how to build that right service. The first is that you believe that the self-service data access problem cannot be solved by making external, post-query copies of the data, which creates this mess that we just talked about. And so far, has that been proven to be an important part of the success that your customers have, avoiding extracts?
A hundred percent. In fact, we have many companies that have tens of thousands or hundreds of thousands of views, where the users are interacting in a logical way and that whole model, that ability to be so data driven, just wouldn’t be possible if they were creating copies of data. You can’t imagine that many copies of data, which is kind of what you get if you’re using things like data prep tools and ETL tools.
What Dremio delivers instead of an extract is a query that allows you to access that data whenever you want. And so that query is developed by having lots of other subqueries that can be rolled up. You have landed data, then you have queries on top of those that create views and then views and views on top of views. Eventually you have a view that you’re able to extract and because you’re not arguing that you don’t need purpose-built data sets and you’re not arguing that you don’t need a lot of them, what you’re arguing is that those should be accessible on demand and shouldn’t be extracted.
That’s right. The need for different views into the data and different combinations of the companies’ data, that’s driven by the business requirements. So the need of the business to understand the customer, to understand the risk that a financial services firm might have and the exposure that they have and all these different use cases that exist within an enterprise. That’s what drives the need for different views of the data. And then just the fact that different people have different preferences. I might want to look at the data in one way, you might want to look at that same data set in a slightly different way, maybe we’re each responsible for a subset of the accounts that the company has. If you start creating hundreds of copies and for each of those you then have to do all of the additional work on that, that’s obviously not realistic.
The second thing is that if you’re going to rely on queries to be your unit of delivery of the purpose-built data, you better have them be delivered fast. Your next dogma is that you have to have an in-memory columnar engine. Essentially what you mean is you have to have a fast query but you guys believe that the best way to deliver that is through an in-memory columnar engine and you base yours on the Apache Arrow project, which you founded, and that’s the way to deliver maximum efficiency in both query speed and efficient use of memory and computing resources.
That’s evolved a lot since you wrote that article back in 2017. In fact, Arrow is now downloaded 4 million times a month. Arrow really started as a project that we open-sourced a few years ago when we were still in stealth. The idea was let’s take Dremio’s memory format and open-source it with the goal of that becoming an industry standard. And there are a lot of organizations and projects out there that had the same kind of interest in having this standard in-memory columnar representation for analytics. When we started Apache Arrow, we did that in partnership with the Pandas community and with Wes McKinney, who is kind of the primary author of Pandas, and really worked together on building that first implementation and support for various languages. Initially it was just a memory format and a number of libraries in different languages to access that memory. As we had anticipated, there was a significant need in the market for many different organizations for something like this. Fast forward to today, there have been over 300 developers that have contributed to the project, many engineers at companies like Nvidia and Intel and others working on this and making it really fast with various CPO and GP architectures.
What that does in terms of the Dremio architecture is deliver that fast query so that you can have huge amounts of queries. The next dogma is that you believe that you have to have a zero copy, zero serialization and view serialization structure that allows the same data to be used by multiple consumers so you can achieve efficient use of memory and compute.
Yes and we’ve really enhanced how the system works so that now you don’t even have to think about that, the system automatically brings things in and out of memory as they’re being accessed. So there’s really no decision for IT or for the users as to what is being cached, what’s not being cached that’s entirely transparent to the user. From their standpoint, they’re just getting really high performance by querying the data and by our ability to leverage the latest Intel SIMD instructions and GPU technology. On top of that, one of the new initiatives we’re doing with Apache Arrow is something called Arrow Flight, which is a parallel RPC layer. It’s actually something we’ve partnered with the Pandas community and with Two Sigma to create an alternative or a more modern approach instead of ODBC and JDBC. So really designed for the data scientist and we’re seeing about 50x performance improvement over ODBC.
Getting back into the realm of the users, the next dogma is that using SQL to define the queries is an important way to make sure that you can build on the knowledge that a lot of people already have and that any trickiness about handling JSON should be underneath the hood. You shouldn’t ask somebody to learn a new SQL to deal with Dremio.
Yes, a new SQL or a new DSL, a new domain-specific language. I think the world has kind of settled on a standard way of asking questions on structured data and that’s SQL. And it’s settled on that in the same way that English has maybe become a common language. Is it structurally the ideal language? Maybe not, but everybody already knows it and you’re not going to change that and there are hundreds of tools that use it and millions of people that have learned it and universities that teach it and high schools that teach it. There will be NLP front ends that translate natural language questions to SQL and there will be all sorts of interfaces. But fundamentally, the kind of core language to ask a question on structured data, that’s SQL so that has to be the language.
And then the other thing you’ve asserted as a dogma is that you want to separate the definition of the virtual data sets that represent the data and all the special purpose data sets from the optimization of the query. The idea is that you can use anything under the hood you want to optimize that query, based on the usage data and the resource constraints.
The core to making the data lake appealing to an end user and to Dremio working is this notion of separating the logical layer from the physical layer. And our belief is that users will only be able to interact in a logical layer, you can’t expect a consumer of data to go start creating new data assets and optimizing data structures on disk and things like that. But a system is well equipped to do that. So our approach is let the users interact in the logical layer and let Dremio and its optimizer technologies figure out how to maintain various aggregations of this data, and then have an optimizer that can automatically leverage those things when the user’s interacting with the logical layer. And that’s a very different approach from the traditional way in which data infrastructure was built, where there was that tight coupling between how data is physically represented and what users need to access.
You guys have always both believed that your data platform should be based on open source but you’ve also constructed it in a way where you have separated the open source innovation from the productization that you’re offering. Why do you think that you’re going to avoid the same problems that we’re seeing with the other companies like MongoDB or Redis that have had to make some significant changes in the definition of themselves as open-source companies in order to preserve their commercial interests?
I believe that companies want to adopt open technologies. And open source, it’s more than that though, it’s how is data stored, for example. Is it stored in open formats like Parquet and text and ORC and JSON? Can the technology be used in different clouds and both cloud and on-premise? And is the source code accessible, can people extend it? So all these different characteristics I think are really important because there are a lot of new technologies but people were really frustrated after they had gotten locked into some of these on-premise technologies, they don’t want to make that same mistake again. And so I think the future is going to very much be a future where companies embrace this kind of open source and open standards. And then in terms of how we do it specifically, I do think it’s important for any startup or any company to think about the overall climate and what are the risks to their business. And in our world, we want to be able to provide a cloud service and we want to be able to monetize that and we don’t want other companies to kind of ride on our innovation in terms of offering Dremio as a service, for example. And so we’ve structured our community edition and our open source in a way that makes it very difficult for that to happen while at the same time enabling our customers to enjoy all the benefits of open source and we have many thousands that adopt it every month and use our community edition.
It sounds like you’ve done is you’ve thought through and realized that you would like to participate in an open innovation process and that’s why Apache Arrow is an open-source project and you also want to give people the access to that open innovation process. But you also realize that an open-source project is not a product. And your product is going to be something that is carefully designed so you can both allow it to be used in a freemium way but also capture value as people use it more and more intensely.
Yes, that’s correct and I think anybody who’s using Dremio and getting value out of it, they realize that they want the company to be successful because the more revenue Dremio has a company, the more we invest back into engineering and developing new capabilities and new features. So it’s really a win-win, when you think about it. But it does require thinking about these things from the beginning.
The next one is one of the most interesting dogmas that you have and that is the idea of the performance of the source repository needs to be overcome by your querying mechanism. Now, this is really interesting. Because you offer an ability to do optimized queries over lots and lots of different source data sets, you actually make data that would not be joinable in any easy way to be joinable inside of Dremio. In addition, you make data that would necessarily not be able to be accessed quickly or accessed in a random manner or with a query to be accessible by that. Have you found that your customers have been able to break new ground because they’ve all of a sudden now been able to combine data sets they’d never been able to combine before?
Absolutely and let me give you two examples of that. One of them is UBS, a large multinational bank, and UBS has been able to for the first time combine trade and customer data. And that’s something that they were able to do because of Dremio for the first time and that’s just been something that was traditionally really challenging because the trade data was stored in one set of systems and the customer data was stored in a different set of systems and in the data lake. And then another example is Royal Caribbean Cruise Lines. They have created a very modern data architecture in the cloud, on Azure, and they have data in 26 different systems. So those systems range from property management to the casino to the reservation systems.
The next to last one is that you believe that the self-service data platform must integrate many distinction functional capabilities. Have you found that your customers have migrated away from the systems where you had multiple solutions to deliver this stuff into Dremio providing it all as a unified system?
Yes, I think it’s not so much that we’re advocating, oh, you should not have an enterprise data catalog. Because, you know, companies have a lot of reasons for having these types of technologies. But our ability to combine a number of different things into a unified platform makes it possible to do things that otherwise wouldn’t be possible.
The last dogma was that you’re not really trying to be a visualization and data science and data discovery tool like R or Python or Tableau or Click. What you’re trying to do is deliver data to those and create data sets that can be used by those in an optimal manner.
That’s correct for two reasons. One is that many of these tools are actually tools that the users love. So when you talk to business analysts, they enjoy using tools like Tableau and Power BI and the data scientists enjoy using things like Python and R. And the second reason is that in many companies, there are a variety of different tools that people want to use. So one person is an Excel user, the other one is a tableau user and the third one is a data scientist that’s building models in TensorFlow. And so if you want to build this data abstraction layer and a data lake engine like we’ve built, you have to be able to serve all these different classes of users and they each have their own preferences. So for us it didn’t make sense and doesn’t make sense and to be that end tool, it’s much better both for the experience of the users and for us as a business to partner with the various BI vendors and data science applications.
There’s a certain type of transformation you want the user to be doing once the data has landed in Dremio and there’s another type of it you probably don’t want to be doing in Dremio. And so the way that I’ve always imagined this would work optimally is that if you had heavy lifting to do on the data or really complicated ETL to do on the data, you would use something like Spark or one of the ETL platforms like Tableau or, if you needed to, one of the data prep platforms or the data quality platforms, depending upon the problems that you have in your data to then create data sets that are then well behaved and in high quality. And then you would put them inside the Dremio repository because you don’t want to actually be doing heavy-duty data cleanup inside Dremio. You told me last time we talked that there actually is a pattern emerging in your clients where people are using, you know, Spark, perhaps from Databricks, to do this sort of heavy ETL and then delivering that into Dremio. Is that pattern gaining momentum?
We’re seeing that across the majority of our customers, actually. This is no different from other industries even, where you have in the telecommunications industry or logistics and kind of supply chains where you have the long haul and then you have the last mile problem. And solving those two things are always very, very different in approach. So if it’s transportation it’s what is the long haul, how do I get it from one side of the country to the other side? But then the last mile is how do I deliver it to the houses? And it’s the same thing with data. So you have a need to do this heavy duty, maybe run these jobs every night to do some preprocessing of the data and that’s really done by data engineers and by IT, often using technologies like Spark and various ETL products. But then you have the last mile, which is no matter how organized and clean you’ve made the data, the various users in the company that are consuming the data, they have different needs, just for business reasons. They each need different perspectives of that data and there’s really no such thing as the gold copy. That’s a big deal because traditionally the last mile problem, the way it happened is that either the users were not able to get what they wanted or they were working around IT and downloading data into spreadsheets or into their own kind of databases or BI tools and IT was losing control and so you were kind of having this data governance and security issue.
I’ve noticed in lots and lots of implementations I’ve seen that the role of object storage is really increasing in terms of its importance. And a lot of people are moving data from source systems into object storage as the canonical format and then doing transformations to create multiple versions of that and object storage. I believe the rise of object storage is one of the reasons that the Hadoop ecosystem is having such trouble. Because the best thing about Hadoop was the scalable, low-cost HDFS file system and now that you’ve got that in object storage, you’ve really made that a lot less unique and a lot less differentiating. Do you find that your clients are increasingly using object storage as the repository they’ve used to load up Dremio?
That’s the majority of the opportunities where we’re engaged, opportunities where the data is sitting in an object store or at least some of the data is sitting in an object storage. Obviously, Dremio connects to many other sources as well and allows you to combine them but the data gravity is in something like S3 or ADLS. That’s a big shift in the industry, it’s impacted the Hadoop vendors of course as well.
The last thing I have is a challenge to you. And that is if you imagine how Dremio works, you can land data in it and then you can then create views that maybe organize that landed data into modeled data for a particular purpose that might be reusable modeled data. Then you can have on top of that using that data combined with whatever else you want to have, you know, purpose-built data. In Dremio you can capture this data, you can create these layers but these layers aren’t really supported explicitly. You have, obviously, ability to segment the access to views and things and you obviously have the ability to add notes and things like that. But the semantics of the data are separated and they only exist in the minds of the people using the data, they exist in maybe some notes and metadata about the data. Have you thought about how you can go one step further and support the definition of that data product and also some explicit semantics on that so you can start doing a little bit more advanced reasoning on top of some of the data in the catalog rather than just have the semantics be completely divorced from it?
We see our customers and the users of Dremio creating these semantic zones in the data lake using Dremio. So you’re right that they’re typically kind of into the object storage or into HDFS, companies will kind of load data from various places and have that data there and then they use Dremio to expose that data for ad hoc analysis. But in order to make the data lake consumable or consumer-friendly and to make it intuitive, you have to have that semantic layer where you can have things that are structured in a more business-friendly way, that’s been validated by IT. And so what we see people doing is the IT team or the data team will use Dremio to create this semantic zone where they provide kind of a set of the official governed and defined, IT-approved data sets in the form of virtual data sets. And then they only expose those data sets, or at least to some of the users they only expose those kind of approved data sets, the modeled ones to the rest of the users. Now, those users can then go and build additional virtual data sets and an additional layer on top, which is their self-service collaborative layer. But IT’s ability to use this technology to create the semantic zone, where they can filter things, they can mask PII and they can create these very clean and elegant structures—I think those two things kind of coming together in the semantic zone of these data lakes is really a killer capability for many of these enterprises that we’re working with.