Helping Out the Data Engineers: A Podcast with Alluxio’s Steven Mih

On the latest episode of the Designing Enterprise Platform Podcast from Early Adopter Research (EAR), EAR’s Dan Woods spoke with Steven Mih, the CEO of Alluxio, a data orchestration platform that came out of the AMPLab at Berkeley. It is Mih’s second appearance on the podcast; the first episode covered the core value proposition of Alluxio and related issues with respect to the evolution of the data platform and open source. In this episode, Woods and Mih discussed what it’s like for an early adopter trying to create a data platform in the modern environment. Their conversation focused on creating a data layer that allows abstraction in front of many different data sources so that companies can, in an orderly fashion, refactor what’s underneath and move it to where it’s optimally stored and optimally delivered. This is what Alluxio was designed to do. 

Their conversation covered:

  • 2:00 — Creating a data layer that utilizes object storage
  • 7:00 — How Alluxio approaches platforms from a compute optimized way
  • 18:00 — How Alluxio helps data engineers
  • 21:00 — Alluxio use cases


This is an edited version of the discussion that can be heard in full on the podcast.

Woods: It seems to me that the next level of data platform is going to be one where all of the certain data stores that you have from the data lake to data warehouses, and all the different data stores that you have coming from special purpose data spaces, APIs, or feeds, are now delivered, and the consumers have to deal with the actual way to grab that data. To do so, it would be much better if you had a standard data access layer that allowed you do whatever you needed to do. In addition, it would be great if you could hide some of that data engineering work underneath that layer as well so you could produce an orderly set of landed data. To me, there are three layers in most data platforms. There’s the landed data, where the data comes—it’s often transformed in some way, but then it’s put in some repository in a way that is now ready for further use. Then the next layer up from that is where the landed sources are then created into standard data. That’s often modeled data, where you’re modeling all customer information that you have. Then on top of the standard data, you have the purpose-built data that is used for anything from a Tableau to a QlikView dashboard to an AI workload to an application. What people are now facing is how can they create some layer over the landed data so that you can have a movement underneath? Now, Alluxio has a perspective on that, and it’s one that allows what I think is important, which is shifting more of your data into object storage as its canonical form. How would you react to that summary?

Mih: When we look at these, the different types of data, the landed data is usually your source of truth. And that’s because you’ve got lots of producers of data and you’re landing data. And a lot of that landed data is about storing the data efficiently. And there’s lots of technologies that have come through the evolution of IT, and object stores are becoming the lowest common denominator because it’s all about durability, cost per byte and object storage—thanks in part to AWS S3 and all the S3 compatibles that came after that—that have really proven that is the lowest cost per byte, the most flexible, easiest pace to put your data. When you say that there’s standard data and then there’s purpose-built, these are more silos of the same data, but they’re transformed slightly to be more consumable. Instead of the producers, it’s the consuming engines. And there are many engines that an advanced analytics team or AI team will want to use. You know, Spark, Spark SQL, Presto, TensorFlow, PyTorch, as examples.

Increasingly, all of those sources are using native object storage as the place where that data is stored.

Yes. I think there are two ways to have the engine work on the data: either it’s working directly on those object stores or it’s using the object store as a backing store so you ingest it into something like Snowflake. These consumers of data, they’re usually open source to service systems, and they have different formats that they expect. Some are standardizing on a Parquet format, but the objects may not be in a Parquet format, and that’s where there’s some impedance mismatch, whereby you’ll have to transform and make standard data or purpose-built data. Alluxio is approaching the world not from a storage-optimized way, but in a way that is compute-optimized, and we look at delivering data after it’s landed and delivering it in the best, most efficient way for that particular engine. And so in the example of Presto, we’ve added a catalog service which lets you then transform data to like a CSV file, and make it into a Parquet file, and coalesce files together or sort tables so it’s better for that specific engine.

And then can you have other stores underneath it, such as SQL stores and other NoSQL database stores and things like that?

For Alluxio, we’re focused on the analytics and AI, and that’s typically with the files and objects. It’s oftentimes those NoSQL stores that will still get ingested into your data lake, or data lakes. And that’s becoming much more spread out. People have data on prem, people have it in the cloud. So Alluxio is a way to modernize this system by mounting those to an Alluxio layer, and it’s all about delivering the data to the consumers versus landing it.

Right. The most flexible, cheapest foundation for a data platform is to say you may need the NoSQL databases, you may need Snowflakes and data warehouses, but if you want to try to create one data lake equivalent without having to lock yourself into one cloud or ignore the ability to store the on premises stuff, what Alluxio allows you to do is to declare an abstraction layer underneath which you could have data in Azure, AWS, or Google Cloud, all in their object storage layers. You can also have data in an on prem in either files or object storage, and then you can deliver data from your data warehouse into object storage. 

You got it, yes.

You said there are two things you can do that are advancing Alluxio. One is transformation service. The other one, you have a catalogue to unify all data in one place to see everything you have. When you’re dealing with the data that’s coming from the canonical sources into the landed form, that’s not a self-service world. That’s a world in which you have data engineers who are programmers dealing with it. Are the tools that you’re talking about for the data transformation layer, are these data engineering tools?

It’s for the data engineers, yes. There are not enough data engineers in the world. There’s more and more analysts and scientists for data, and they are running more workloads that these data engineers need to manage. We work with a pretty large group, a Martech company down in San Jose, there are 30 data platform engineers. But they’re swamped. 1800 analysts that they have to support, and that’s only growing. But the data engineering team budgets are pretty flat, they’re not going to be able to grow their Hadoop cluster anymore. Their number of petabytes is increasing by 70% a year, and their budgets are maybe increasing by 5% or 6%. They’re under siege and need better tools to provide the data sets to keep the landed data. Data then becomes a product in the form of a dashboard that everyone keeps using, so then it becomes a business process. It becomes ingrained in the business, and that data is always pulled. But there’s a lot of discovery that needs to occur. So the data engineers have to deal with people doing ad hoc discovery so that they can then solidify what is the right dashboard or what is the right model they’re working on.

Do you offer something that’s similar to the Paxata or Trifacta tools for data engineers, or is it more like something like the Talend level with ETL programming?

It’s more of the latter case. It’s working with mounting and integrating these different data sources, and then the metadata associated with it. So to answer your question, one of the new things we’ve added, the fourth big component for data orchestration, is the Alluxio data catalogue. And that takes the Hive Metastore and abstracts it, as well as your AWS Glue Catalog, and the other catalogues from whichever cloud provider you’d like to use. In the same way we have the unified name space, now we have the unified metadata catalogue. And that gives us the ability to then do much more optimization for those compute engines like a Presto. And so mounting those different data sources in different environments, we have the policy-based data management to move the data around, but the great thing that the data engineer has that’s different than the ETL is that we’re all about making sure the hot active data is brought to those engines. And doing that uncovers what is the important stuff, and, in a way, that becomes the purpose-built data. 

So you have the caching of this data so you can deliver a performance SLA, not just an availability SLA?

That’s right. And what we’re finding is that people are excited about the solution zero copy bursting of analytics into the cloud. There’s many Hadoop teams on prem that are faced with what I mentioned, they have more workloads, but they can’t just continue to buy more Hadoop servers. And so they instead would like to start to use the cloud, but in order to do so just running the same workloads in the cloud, you still have to figure out what data goes in the cloud. That’s a copy situation. With Alluxio, it’s zero copy because of our intelligent caching layer: it’ll pull the data, offloading that on prem HDFS, allowing you to run a thousand Spark jobs ephemerally with Alluxio there, or a thousand Presto workers with Alluxio there. And thereby not needing to know even what data to copy. And we have quite a few great enterprises there that are doing this today.

Right. And the idea is it’s the object store read-write paradigm, where you read a bunch of object storage, and then you write back new object storage. It’s not like you’re updating the object storage.

Yes. We can do that too. Either synchronously or asynchronously, but now if you want to say you’re migrating to the cloud, well that compute is in the cloud if you mount the cloud object store, and you could write it to there. Then you have a seamless way to migrate.

What you offer somebody is if they’re building a data platform, and they want to be able to use object storage more broadly, and abstract away some of the choices they’ve made about using data at scale, such as using Hadoop servers, you are able to put a layer that understands object storage in any of the clouds and create one single layer that you can now rely with a catalogue? 

That’s right.

Underneath that, if you decide you want to get rid of your Hadoop cluster, and put stuff in native object storage, you can do that, but your consumers never really need to know, and never really care.

That’s right. Alluxio abstracts that so the consumers don’t have to know, but Alluxio knows now that we have that catalogue and so we can optimize by coalescing the data, if there are many small files. There are three major transformation services that we’re announcing. One is the Coalesce, so small files can be coalesced into a small number of big files. The second one for those data engineers is being able to transform the format. So a lot of data is in CSV format, and so if it’s CSV, that’s not very easy to consume. Alluxio will transform it into Parquet. Then the third big item is around sorting. If a table has a date column, you could sort it by the date. And then if a query says I want Q3 of last year, guess what, it will only pull Q3 of last year, because it’s been sorted for that. By doing all these things together, we found that it doesn’t matter what format the data was in when it was landed, which storage format, or where that location is. Now we can provide local performance and compute-optimized performance up to 5x the performance.

The idea of your data transformation service is about having standardized large-scale transformations that are popular with data engineers, that they can just do. And then gradually, over time, that library will grow?

Exactly. We’re trying to have the data engineers do less wrangling and less silos. It’s at that scale, versus I would say higher-order ETL. 

Where has this made a difference in terms of people doing things that they couldn’t do before that really provided business benefits?

When you start to talk about doing up to 5x the efficiency gains, now you have these hundreds of data analysts that can run that many times more jobs. Or, the data scientists that are creating models, they run that much faster to create the models for the business. And so this is what we call the TTI, time to insight. You now have a zero-copy way of doing these things that gives you the time to insight and so the business benefit is that the business has better insights, can run more jobs at lower cost, and then your data engineers are also more efficient and happy about the tools that they have. 

Alluxio also makes a standardized layer under which you can make moves. Who are you finding that’s interested in this platform? What are the use cases?

The types of use cases that people use us for are enterprise companies that have on prem, they may have multiple data centers, and they’re adopting cloud. And that’s pretty much every single company. They have to deal with data engineering problems around, “Okay, how do I now have the ability to give my internal customers my end users a good experience, and self-service data,” and that is the concept. So what we find is people want to make sure that they can expand their amount of data-driven initiatives and do that in a way that isn’t the Hadoop paradigm, or else the HDFS paradigm, but is in a modern architecture that takes into account the multi-cloud, as well as multiple consumers, and a growing data set of course. 

Your metadata catalogue asserts over this object storage the tabular format. Now is this something that extends what’s going on in the cloud catalogues like AWS Glue? 

We’re combining them. We’re not extending them today, but by abstracting both of them, now we have the ability to see the metadata wherever it is. You can run federated queries to different data sources, and bring them all together. So instead, you had to copy all the data, ingest them into one big lake, and then run it all there. But now, the view is let’s have it more disaggregated by having the multiple silos. This whole idea of being able to run your query in place, wherever those storages are, is a very powerful concept, and Alluxio is a big factor in delivering that data from remote places to the engines.

It seems like you should be able to find some good use cases in the IOT and the manufacturing space where they have a lot of distributed data sources. Have you found that you’re helping to unify Edge data or manufacturing data?

Mih: We’re still more in the core right now. And so I think that we’ll get to that place. We have a few use cases like that because of course our software is open source. We have hundreds of deployments all around the world. But there was an IOT use case recently where it was a self-driving startup company that was using Alluxio in that way. 

We haven’t seen a mass retirement of a lot of the Hadoop infrastructure yet. But given the pressures that Cloudera’s putting on people with the higher maintenance costs, and the continued adoption of S3 as a standardized API, it seems like we’re moving toward an era in which there’s going to be a replacement of a lot of the Hadoop infrastructure with an infrastructure that’s based on object storage. Have you seen that? 

Yes.

What are the other ROI wins that you’ve seen with Alluxio?

I think there are a lot of topics they say, “Hadoop’s dead,” or, “Hadoop’s not dead,” and what’s going away is the idea that HDFS is the best, lowest cost-per-bit storage system. It’s not. Because object storage has been proven to be lower and lower maintenance, and it’s infinite if you use the cloud. And so that war was wrapped up. But Hadoop still lives widely in the different types of engines that they have. What I think is changing is that the architecture is going from a collocated architecture, where you’re running the engines with an HDFS collocated, to breaking that up. We see plenty of customers that were very big Hadoop customers, and they still maintained that, but they’re not going to grow that anymore. They’re looking to offload that in the same way that we see with Teradata. There will always be vertically oriented data appliance systems that are really awesome at what they do, they’re just really expensive and maybe somewhat difficult to operate. I think HDFS is going to have that same kind of go by the way of the dinosaurs that way. But it just will evolve into a disaggregated stack, much more data in different places. As for ROI, if you think about when you have a modernized data platform, which uses an orchestrated or abstraction layer, it needs to create a localized access layer to wherever your compute is. That gets you ROI from a cost perspective. But more importantly, it’s about that time-to-insight. You have a top line that’s hard to see, but most of the people we deal with say as long as the data scientists and analysts are happy, then we know we’re doing our jobs right. As long as the company is being data driven, and getting the best data they have, then that’s where the ROI comes from.

But you’re not yet to the point where the CIO at the end of the year says, “You know, we were able to retire x systems of this sort because we now have adopted our object storage, and we’re able to do it in an orderly manner”?

What we’ve mainly seen is offloading when there’s a certain cost structure with the legacy system. Taking workloads off of that into a much lower cost structure, that’s where the ROI appears. 

Let’s go back into your database history, and let’s address that question that you mentioned about the role of the data warehouse. It seems that there’s an interesting problem that’s going to arise. Some companies have seen it and have a solution for it, and some don’t. If you put everything in object storage, and you have essentially a bunch of flat files that are out there for use, you then can do simple queries out of Presto. But if you are talking about having thousands of users and thousands of tables, now you’re in a different world, and Presto would suck at that, because its query optimizer is really about big, large batch queries. Maybe they would disagree, but if you look at something like Teradata, it’s got 30 years of work on that problem of thousands of users and thousands of tables, without having to have a query cop stopping people from doing stupid stuff. Now, what replaces that? Because as your data platform grows, you’re going to have more and more people using data, and it seems like it’s going to be really hard to get away from SQL.

First off, I think the hive was more for the batch and Presto has, one of the four main use cases is ad hoc discovery of insights. That was proven by Facebook, and they run huge, huge clusters of this stuff. When I look at a legacy system like Teradata, it was really a scale-up system. Now the world is all distributed systems, and many are open-source projects. You can have a thousand-node cluster or multi-thousand-node cluster, that can scale horizontally, that makes that bottleneck of having a hundred-thousand queries going on in a day and it scales all that out. So I believe that that’s been addressed. Now at the object-storage layer, we see problems there. Because these object stores weren’t necessarily better to deliver data to that many concurrent queries. And that’s where having the data orchestration, where we locally cache the information that’s needed within each worker node, has even more value. And so for some of the biggest companies, like Walmart, they use Presto and Alluxio, and they talk about it at their meet-ups, because the Alluxio layer lets them handle the level of concurrent queries that that organization needs. So I’d argue that if Walmart can do it, then I think many other companies probably can do it. 

You’ve got a lot of these systems that are being used by thousands of people, where the SQL is being written by a machine, and it’s really sophisticated SQL. Teradata does a great job of optimizing it and delivering a snappy experience. 

I think it’s a matter of time because this goes back to database architecture fundamentals, and one of the aspects of a big data warehouse was a buffer pool. And Alluxio, in a sense, is a scale out version of the buffer pool. And the cost-base optimizer is in these vertical data warehouses being created and they’re just being redone again 10, 20 years later.

So to the extent that you can do a worse job with your cost-base optimizer if you do a better job with your buffer pool, so Alluxio can allow you to have crappier SQL optimization because you have better data delivery.

Yes. And you’re going to need all of it, you’re going to need everything optimized. 

Thanks. This has been another episode of the “Designing Enterprise Platforms” podcast from Early Adopter Research.