Open Source Data Management: A Podcast With Ahana

This edition of the Designing Enterprise Platforms podcast from Early Adopter Research (EAR) focuses on data management. EAR’s Dan Woods leads a discussion with Ahana’s Steven Mih and Dipti Borkar about using SQL as an integration point, along with how to use object storage.

Their conversation covers:

  • 4:30 — Presto’s history and relevance to today’s data management landscape
  • 18:00 — How Presto can deliver a single repository on a query
  • 25:45 — How Ahana plans to be profitable on the support model
  • 35:00 — Why query engines are valuable to get value from your data

Listen to the podcast here or read an edited version of the conversation below.

Designing Enterprise Platforms – Early Adopter · EAR Podcast with Ahana’s Dipti Borkar & Steven Mih July 2020

Woods: Will you introduce yourselves?

Mih: My name’s Steven, and great to be back on the Early Adopter show. I’m the cofounder and CEO of Ahana. We used to be at Alluxio and we love open source and distributed systems and the data space. 

Borkar: Dan, always a pleasure talking to you. I am Dipti Borkar. I’m the cofounder and chief product officer at Ahana. I come from a database background. 

Let’s get right to it. We’re going to talk about the challenge that many people have in making sense of what’s become a very vast and wide data landscape. There are two problems that I think we need to talk about that help us understand some of the things you guys are working on at Ahana and will help people understand how better to organize their data supply chains. The first problem is that companies have this sprawl of data sources and many data sources in many different places, and then you have a variety of tools being used to get at those. The question is how to make sense of this landscape so that all tools integrate with one system and then that system integrates with the data sources so that companies don’t have a cross product of dozens of integrations to support all the time. The second problem is how to adapt to a world in which object storage, the kind that we have on the public clouds, has become a much more important way to store data because it can be used by any type of engine once you store it in the object storage. How does that object storage become more useful and a more important part of the data management landscape? But first, would you talk about where you got the idea for Ahana and why you think Presto is relevant to both of these problems?

Mih: I can give some color from what we saw from a market perspective. Early adopters listen to your show. There is a proliferation of data sources. Some people call these data silos, some people call these data lakes and databases. And there’s more and more bespoke types of data systems out there, data management systems. We used to be at CouchBase 10 years ago, and that was the best tool for the best job type of NoSQL databases. So there are many different places where data is sitting and there are many more tools that are being associated with that. We saw with Presto that that’s a tool that can help being the single place with SQL being the lingua franca. That problem was so apparent, that Presto was solving that and Presto was growing so much because of that, that we felt that we wanted to be part of that space.

Just for people who don’t know, Presto is a database that was invented at Facebook. It is a SQL database that was created to handle very, very large-scale SQL access to really massive amounts of data. And one of the things it did very well was it allowed the repository to be Hadoop files storage and then eventually it changed to object storage. It was an open source tool that started there. It really spent a lot of time making sure that it could do federated queries across many different repositories. That was a really important mission of Presto. Being able to do super scalable queries in terms of volumes of data. And being able to handle different types of repositories, that was another mission. Presto was essentially built to be this integration point and also built to handle these different kinds of object storage and data sources. You had a mission when you started Ahana of serving, as your first customer, people who had that problem of lots of repositories who also had an appetite for using object storage. So describe what happened at some of your larger clients, the way they use Presto to solve these problems.

Borkar: Let me jump into that, Dan. So in terms of Presto, it can be used in many different ways. Like you said, it can serve on top of HDFS, which is the Hadoop file system in the Hadoop ecosystem. It can be a layer, a query engine on top of data lakes or object stores, as well as many other data sources, NoSQL, columnar data sources, streaming data sources, et cetera. In terms of usage, the usage varies depending on where the data is, where the deployment is, if it is in the cloud or on prem. And that’s the power and the flexibility of Presto. Given that it connects with many different data sources, it is a federated equity engine that speaks SQL, understands SQL and translates that down to any data source underneath it. It can connect to any data wherever it lives and give you the ability to query it in place without moving that data to another location or creating a copy of that. Those are some of the most important things of Presto and what we’ve seen in these large-scale deployments. They could either be on prem or in the cloud, but they query large quantities of data in real time and interactively in an ad hoc way, which was not possible before. When Facebook built Presto, it was a replacement for Hive. And Hive is another query engine that sat on top of HDFS in the Hadoop ecosystem, which was meant to be a little bit more of a batch interface and not really real time. You’re not talking about seconds, right? You’re talking minutes or hours. And Presto changed that by being in memory, by being fast, by being ad hoc, and all of these advantages, if you will, come together and make it one tool that can solve all these problems. Whether it’s BI tools, whether it’s notebooks, whether it’s dashboards, all those tools speak SQL. They can talk to Presto and can get access to that data across all their sources.

And that’s the first use case that’s driving adoption of Ahana, which is I’m at Uber or some other large company, I’ve got dozens of groups using data and the IT department is getting sick at having to connect those dozens of groups to hundreds of data sources. If you have a new way of using data, whether it’s Tableau or Qlik or Jupiter you can connect to Presto, and then if you have lots of datasets, you will make those datasets connect to Presto. And then that just allows all of a sudden IT’s life to be  much easier. That integration point is what’s driving the first wave of adoption of Ahana. 

Borkar: Exactly.

There are a bunch of problems inside that I wanted to talk about that are related to the object storage as well. Most SQL repositories have some sort of metadata repository and Presto can generate the SQL to grab the data out of those repositories. But if things are in an object store, it has to have somewhere that it knows the structure of that object storage, and that’s where the Hive catalog comes in. Could you explain how the Hive catalog allows Presto to do its job with object storage?

Borkar: I believe that your data is only as good as your metadata. Because if you don’t have access to the metadata, you don’t really know what your data is and you don’t know what to query, right? In the object store world, given that it’s based on files and not records and tables, you essentially have different file formats like Parquet or ORC and these are the files that represent your structured data that’s sitting in the object store. The Hive meta store, which is one of the components of the Hadoop ecosystem, is kind of an uber catalog. It can hold the metadata for the tables that are represented in the object store underneath. There are other catalogs like the AWS Glue catalog in the cloud that is also getting popular, and so there’s a couple of different approaches in terms of managing metadata for the object store itself. Now Presto, like any query engine, in the database, et cetera, needs access to the catalog so that it knows what tables, what columns to pull out of the data that’s sitting in the object store. And so you have to actually connect Presto to a catalog, to an external catalog or Hive meta store that’s being managed somewhere that has this information, or AWS Glue that’s sitting somewhere, and once you connect it, you can then create tables or read query data through the Hive catalog because you understand the representation of the data underneath it. And then the way that this works with a typical tool like Tableau is using Ahana as an integration point on top of Presto. Tableau asks Presto, give me the metadata for these tables. Presto then goes to the underlying store, gets the metadata and then delivers it back up to Tableau. So Tableau only has to know how to deal with Presto’s metadata interface.

Underneath this Presto abstraction layer you can do different things like move repositories from say a Teradata database into object storage, or from an object storage into a high performance SQL like Teradata. You can go both ways depending upon your use case. What kinds of things have people been doing once they have this integration point in? 

Mih: There are people that are absolutely looking to offload workloads that are on legacy systems, and instead of necessarily copying the data all into a data lake and running an engine like Presto on top of that, they can query it in place and then over time move that. This starts to change the dynamics of these vertically orientated legacy systems in terms of what it’s used for, and we think that that’s a pretty exciting use case. But there’s no better architecture than a federated architecture, because it doesn’t require you to copy the data from one place to another just so you can do hypothesis testing on it and to run forwards and backwards on it. And so we think that’s the first use case that any company can derive a lot of value. Instead of queries taking hours of time in batch mode types of approaches, Presto gives you seconds of interaction, seconds of response time, and that is really some of the best value. 

Ahana is a commercial distribution of Presto. You sell support for the Presto distribution. At this point, you don’t have premium features or anything like that, but you’re basically selling a supported version of Presto, and then there’s a new foundation that has being created by Facebook that is trying to win governance to the Presto project.

Mih: Number one is we’re an early member of the Presto Foundation. Both Dipti and I are very closely involved with the Presto foundation, which is hosted under the auspices of the Linux Foundation. So it’s kind of like SNCF is hosted under the Linux Foundation. The whole purpose of Presto Foundation is to make an ecosystem of companies that are involved to support the open source project. Just like Kubernetes became very widely adopted across and supported by many, many companies, so is the goal of the Presto Foundation. We absolutely are focused on open source. And yes, we provide support for companies that may not have the operational experience or the goals themselves of maintaining a cluster themselves. So we offer very easy to use versions on the AWS cloud and we provide commercial support for those customers, or they can just use those versions for free and run them themselves. So that’s what we offer today.

When you use the Hive meta store, you don’t have to have an HDFS file system anywhere, correct? 

Borkar: That’s right.

So now, if you look at the problem, there’s a couple of tricky parts to this that are being dealt with differently by different technologies. One is the idea of federating queries across repositories. The point is just to deliver up, through Presto, one repository to a consumption. It’s not to say I have five SQL repositories and I’m going to do a federated query over those repositories. Do you think that’s because the way that people have stored data up till now has been all about putting everything that’s related into one SQL repository, and do you think eventually there’ll be an appetite for a real federated query?

Borkar: Great question. We’ve worked with users over the past few years, and with customers what we’ve seen is that every organization has a method of organizing their data. Each business unit might have a different repository or different data source. So depending on the organization, you may have different data sources, and that’s kind of the starting point where you basically have different teams accessing different data sources coming through Presto, and may not have as many cross data source joins. But increasingly, to be data-driven, you have data science teams and data analysts, and what I’ll call the data hacker analyst who wants access to all of this information across the enterprise. It could be enterprise data, it could be clickstream data, it could be telemetry data, behavioral data. And that’s where some of these cross data source, cross database joins come in to truly be data-driven and to get more insights of the data across the entire enterprise. So it might start off simple because the platform team wants to really enable access to the data in a flexible way with one system to serve up all, but over time we see that changing, to where as organizations get more data-driven, they would want to access and correlate various different data sources and get more insights from it. 

This is where object storage comes in as a really interesting solution, because let’s say you have data in 20 different repositories. You can do the data engineering to bring it out of those repositories and create a set of well-behaved objects in object storage that then integrate all of those 20 different repositories together into a coherent form, and then you can go against those from object storage. So with a two-step process, you can pretty much integrate anything you want, you know, and then serve it up from object storage. Have you found that anybody is doing that sort of pattern?

Mih: I’ll just comment that we’re seeing that object storage is becoming the lowest cost place to put your data. So you can say S3 is the biggest storage system in the whole world, now AWS S3, and everyone has picked up the S3 interface as the way to do it. What we see happening is many data lakes—although they call it a data lake, there’s multiple ones. There are different departments that will each have their own data lake. We see customers using that as their place for most of their data and they run Presto against that. But then when they need to do some ad hoc analytics on another piece of data and they don’t want to necessarily put it in their data pipelines and ETL it, they can use Presto to connect to those other data sources too.

Borkar: This connects to a concept of data marts. They used to be called data marts because you had these big data warehouses and then you wanted a clean, small version of the most useful data and you kind of put it in an OLAP cubes—and you can use many different systems now to store those data marts or those cubes or that clean data. Data lake becomes a good option. It’s a cheap option, but it’s not necessarily the highest speed option. 

One of the reasons that the Hadoop universe was so difficult for people to create a thriving business is that they had an open source project and it was then supported by three different companies and then it was also usable as open source. And so the pattern was created where people were using Hadoop. If they got to be using it in very large sophisticated companies, they would just support it themselves because there was a thriving community and they could do that and they wouldn’t use support from any of the three distributions. And then when it came to starting to support distributions, the distributions argued with each other and created three different products. Now already we’ve got Ahana as a Presto distribution and we’ve got the other Presto distribution that’s been around for a little while that came out of Teradata. How are you guys going to make money?

Mih: At Ahana, we’re packaging up the open source Presto DB. We don’t really call it our own distribution because we put it together but it is that exact Presto DB open source project. This has been open source since 2013. It’s been pretty widely adopted by the data-driven large internet companies like Twitter and Pinterest and others, LinkedIn, that have pretty big data teams, data platform teams that analysts and scientists use. We’ve found that there’s a number of people that would like to get the same benefits but they need to have commercial support for that. If you’d like to get the help, it’s the open source license and you can just purchase that from us, which there’s a support model that is flexible. And I’d say that that is something that the market needs and that’s what customers are asking for. Now, there is an alternative, which is you can go get the proprietary version of Presto, you can pay for an enterprise subscription of that. And we know how that model goes too. It has more of the people trying to get you to use the proprietary stuff. Once you use the proprietary stuff, they say we got you and you want to continue using that. I know how that goes. We don’t have that today. We’re seeing that there is a group of people that are comfortable with that approach. We want to be the folks that support Presto DB and welcome people to use it for free or work with us as people who contribute to the Presto DB code base and are involved with Presto Foundation.

In order for Ahana to give more value to its customers, at some point, you probably want to solve problems that are related to the core problem using Presto. One problem is the Hive meta store was never intended to be an object storage catalog. It was intended to be a metadata for HDFS. You’re also talking about being an integration point. If you’re an integration point, then you have to be able to allow people to find data they’re interested in. And there’s been a bunch of different startups that have been in the business of helping be that catalog and being the place to be the discovery catalog, not just the metadata catalog. Finally, the last area is it would be great to be able to have a federated query that wasn’t just bound to object storage, so to make it a one-step process. Right now, having to do the data engineering, put all the engineered data into object storage and then do queries off of that is a perfectly fine way to do it. But a real federated query would allow you to actually choose data from any of the sources and then create a query against that. And that’s a whole different kind of optimizer than really exists in most different systems. And the question is whether there’s an appetite for it. What do you think about the urgency of solving all of those problems? Which do you think are the most urgent? Which you think are going to be the most popular to solve?

Borkar: Great question. With the separation of the storage and compute, where you have compute and query engine on the top kind of separate and the storage layer completely separated out, which is the object storage. It’s recreated the same problems that the database world has solved over again. Because the database stack, it was a fully integrated stack and it had the catalog, it had a buffer pool, it had a transaction manager, all of these things which you kind of needed to be able to understand the storage layer underneath and query it, optimize it and do it in a performant way. We now have a catalog which is a separate system. We have—we are starting to see format frameworks like Hootie and Delta Lake, which are more transaction managers as a separate piece. All of these are just small pieces that the query engine typically works with. So if Presto is the core heart of the system, it needs access to the catalog, it needs access to the transactions, it needs access to other data. It could be a good way to connect all these together to make it easier for the users. Because once you’ve solved that, understanding and using data lakes and object stores becomes significantly easier. I’m particularly interested to see how the database stack works in an aggregated environment. Because you still need all these components, and how do you get them to work together so that the users don’t have to figure it out? Because otherwise you have, instead of one database, you now have to deal with five different systems and integrate them together and figure it out. It’s a continuum, where you start with some major optimizations which give you a big bang for the buck, and over time, then you optimize a little at a time.

Do you think that we’re going to just see a lot of competition on the metadata catalog for object storage? Is the catalog going to be the next point of competition for ownership of the customer?

Borkar: You have the whole discovery, data discovery, metadata catalog, MDM market. I see that as a different market, because that’s figuring out the master data over the entire organization, not just object storage. And just for object storage, the two today seem to be most popular. There aren’t any others that are coming out. And it’s just a means to an end at the end of the day. The value that customers want to bring out of their data is what the query engine gives them. The query engine gives you the answers that you need and the catalog, the transaction manager, et cetera, they are pieces of the puzzle that you need to get to those answers. I tend to say the query engine and the optimizer are the heart of the database because they are the brain in some ways, because they know exactly where to go, what data to pull. And so it’ll be difficult for individual companies or projects to be successful by themselves. They’ll probably be consumed as a part of the query engine.

If you’re looking at all the different ways I could use object storage and an integration point you have Dremio, where you can put Dremio in front of object storage. You’ve got Snowflake or Action that both have databases that you could put in front of object storage. You’ve got environments in which you can use object storage as a file system, like Okera where you can make object storage look like tabular data so you can access it that way. You’ve got things like Qubole, which are all about integrating a bunch of open source projects together and making that more useful. What do you see as the strengths and weaknesses of that landscape and why is Ahana in a sweet spot?

Mih: One of the things that we saw is that Presto is, as an open source project, the fastest growing project in the data analytics industry right now. On DB engines—we’ve been following this for over a decade and we see how these different data management systems are rising. Presto is the fastest growing at number 25 for relational data management systems and it’s jumping by leaps and bounds. And so if you compare that to any proprietary system, open source is the way to go on this. And if you compare Presto open source to other open source projects, some of the companies or the projects you mentioned have open source projects underneath. Presto has just by far already been picked as the most modern, most performant. And for this federated architecture, that’s needed in this heterogeneous world that we’re moving into with on-prem, cloud, multi-cloud.

Presto is Kubernetes, not Mesosphere, in terms of an analogy to the container management space. 

Mih: Yes, you could say that, and it’s certainly not VMware proprietary stuff either.

Borkar: We truly believe in some fundamental assumptions. Number one, we believe in a community approach, which is the Linux Foundation approach. That as a community, we can build a better engine and there are more hands on it and there’s more innovation. Number two, we believe in the disaggregated stack, which means that you query in place and you don’t consume the data, you don’t ingest the data in like a Snowflake, which is a more traditional data warehouse approach. With Presto, we believe in in-place query. As opposed to ingesting into the system. The third thing is that data will fundamentally be distributed. It will be nearly impossible to have all your data in one place, whether it’s a data lake, whether it’s S3. We tried that. It was called Hadoop. Fundamentally, because of polyglot persistence, because you need specific types of databases for specific problems, data will be distributed across different flavors. And that’s why the abstraction which is Presto with SQL, which is the lingua franca, exists and it will only grow from this point on. 

Excellent. Well, this has been a good discussion. We’ve covered a lot of interesting ground, as usual. Thank you so much for joining me, Dipti and Steven from Ahana.