The Power of Storage Orchestration: A Podcast with Alluxio’s Steven Mih

On this edition of the Designing Enterprise Platforms podcast of Early Adopter Research (EAR), Dan Woods, the founder and principal analyst at Early Adopter Research speaks with Steven Mih, the CEO of a new company called Alluxio. Alluxio is a company that comes out of the AMPLab at UC Berkeley, which brought the world Spark and many other products for processing data at scale. The AMPLab has created a stack that has a variety of different components, and Alluxio was originally Tachyon and then now has become, in its commercial form, Alluxio. They spoke about an issue of separating the storage from compute engines from the actual compete engines themselves. In the cloud, companies have had increasingly amounts of possibilities of storing in object storage and then having a variety of different compute engines applied to that storage. But that leads to having storage in many places. Some of that storage may be on something like S3, some of it may be on premise, some of it may be in another cloud. The solution that Alluxio proposes to this challenge is what it calls storage orchestration.

Their conversation covers:

* 4:30 – How to manage many different data storage platforms
* 16:30 – Powerful use cases for Alluxio’s technology
* 22:00 – How Alluxio fits into the open source world

You can listen to the podcast or read an edited version of the conversation below.


Woods: In my view, what Alluxio’s doing is part of a larger challenge facing most companies who are processing data at scale, and that is how to understand the role of object storage in this new data supply chain. Steven, you were present and observing all through the evolution of Hadoop and Spark, so what I want to do right now is just go over that kind of archaeology of the technology to explain this. Now, I’m assuming you have well-developed opinions about how the data infrastructure and the problems have proceeded since Hadoop really inaugurated the big data era. 

Mih: Yes, glad to share some perspective of how it’s evolved very quickly in the past few years. Hadoop originally was about MapReduce. The original concept was to collocate the compute and storage into one worker node and that could be used on commodity hardware and scaled out infinitely. And there are new frameworks that came to that where it moved from MapReduce into Hive and there’s Spark and there are these new data-driven application frameworks that are coming out every three to five years. Now it’s easy to get compute in the cloud, however, the data storage is an area that has to either be removed or accessed in a different way. We’re helping with the hybrid cloud analytics and we call that data orchestration.

But as we get back to the separation, first it was collocated with the computing storage and what that means is that, yes, you can farm out all the work and then bring that all back together. However, then to scale that cluster, it’s scaled across two parameters. One is the amount of compute you need to do or the amount of storage you have. And that starts to become unwieldy, as typically organizations found that one of those parameters were much more predominant than the other. The new architectures that are evolving are to have a separate cluster or a separate tier of the compute, which could be like Spark or Presto or the Starburst version of Presto, and a separate storage tier, and that could still be HDFS or that could evolve into an object store. The cloud now lets you separate that compute or bring it all together as separate compute and storage in the cloud. 

The problem that we have now is the mess that’s been created. I think that Hadoop’s biggest victory was perhaps HDFS. HDFS provided a way to store huge amounts of data in a scalable file system based on commodity hardware. Hadoop’s way of processing that data, MapR and later YARN, they were not great gifts to the ease of programming at all. What Spark achieved is creating a platform that could sit on top of a large amount of data and was designed as an application development platform.

And so Spark abstracted the storage using the RDD concept, so you could pull data into these RDDs inside Spark. Spark had three or four different programming environments. You could program against the same data in the RDDs and you could have one workload do something, leave it in RDD, and have another workload pick it up. So the ability to pipeline through all these different compute frameworks was excellent. Now, we’ve mapped that world into the world of the cloud. Now we have the need for an RDD layer that spans all the storage we could possibly get to. And then we have the compute engines you mentioned already, whether it’s Redshift or Snowflake or Presto or any of those in the cloud.

What we need is the equivalent of that RDD layer so that you can process a workload, let it go, have its own intermediate form then be picked up by another compute engine, etcetera. And the best way I’ve gotten to understand Alluxio to me is it recreates that RDD layer in an abstract way, with some limitations in the cloud. Is that accurate?

I think that’s a good ballpark way to talk about it. It really is not about the single stock in the data center, it’s an evolution that I think all data-driven organizations are having to deal with, which are the multiple frameworks—not just Spark and MapReduce, but ones being created every few years. Also the organization’s desire to have multiple clouds, perhaps to have less lock-in to one cloud. And so that separation of the compute happens in different places and the storage may be in different places, but the thing that brings it all together is what we call the data orchestration layer. And that’s what was the original creation from the Berkeley AMPLab.

The way I understand what data orchestration does is that it brings together, from an arbitrary set of sources, a bunch of data into a file system sort of construct that’s in memory.

That’s correct. Alluxio is a distributed system that is a tier where you mount the different data sources you have. This can be in different clouds or on-prem. And it has a unified namespace, so it’s a virtual file system for that. And it can be put into the memory of those instances or it has intelligent tiering down to SSDs or the disk and allows the hot data or the active data to be brought closer to the compute furnace. And that is a very unique differentiator for what the data orchestration does.

Is the point of this to be able to transform the data on the way into Alluxio, or are you just bringing data from multiple sources? 

Yes, today the mounting of those different data sources and bringing the active datasets into Alluxio is what the system does. We are adding more features and capability to do the transform along the way. And that’s something we’re excited to be talking about in the future.

The person that wants this is somebody that has a compute engine or a variety of compute engines sitting in one place, a bunch of data sitting in many different places, and they want to bring what’s relevant together into essentially a file system so that they can deal with it as if it were local?

That’s correct. It doesn’t have to be many places, it could be two places. It could be one framework or it could be a couple of frameworks. And we’re seeing folks that are just starting to solve their locality and access challenges and improve the performance along the way. And then we see them continuing to add more as the organization grows from there. The Alluxio system is good for read-heavy workloads and it has write capability as well. The sources are mounted in and it then creates a catalog of all the metadata for those systems. Depending on the type of workload there is, it’ll pull that data the first time, bring it into the Alluxio tier, and it’ll maintain that end intermediate results that’s there as needed. And that is how the data gets localized and the result is much better performance and you don’t need to now copy data to different locations and maintain those separate copies.

But you’re not intending to write back to the original sources?

It will write back if that’s what’s required. And it’ll do an asynchronous write back typically is what we’ve found is a feature that is beneficial because you may want to write it back to the same location or a different location. 

You’re writing new datasets that are results of assembling all this data. That’s what coming out.


The other analogy I think that’s reasonable to make between this is the analogy between the Python Pandas project and what Alluxio’s doing. What Pandas is doing inside a set of Python programs, creating a shareable representation of a dataset that can be accessed by multiple processes, it’s similar to what Alluxio’s doing on a larger scale for larger amounts of data and a more abstract and heterogeneous set of compute engines. 

That’s right. We really believe that the world will be moving toward hybrid and then multicloud and data will be heterogeneous as such. And we’re building an abstraction for all the different types of data and run different workloads on top of that and handle the translation of the APIs. We think it’s a pretty exciting open source project and we encourage people to join and try it out.

What do you think the sweet spot is right now for commercial adoption of Alluxio?

Like most commercial companies that are backing open source projects, when companies run into more mission-critical use cases that they require guaranteed SLAs for from a support perspective. That would be a sweet spot for Alluxio. Another item is if they need some of the differentiated enterprise-only features and those are features that we create for the enterprise customers. But that said—

What kind of processing use cases are people coming to Alluxio with? 

Once you have an abstraction, then you can achieve the real data-driven goals of the organization. What I mean by that is that the data is now available wherever it is to your compute frameworks. And so it’s not just the one job on the on prem Hadoop that you want to run in the cloud, it’s the many different jobs on multiple Hadoop clusters, and those can all be now enabled. And the end goal of all this is to have more insights or to train models better. And we believe that having more data to train models, having the access to more data for analysts and BI use cases is what the value that we help in that environment.

The idea is that the more distributed data repositories that have valuable data in them and the more those repositories are distributed across multiple clouds, and the more users you have with the more different compute engines that want to get at collections of those data that are also distributed across all of those repositories, the more you would need Alluxio? 

That’s correct. I understand that many of your listeners have external facing products. And, in that case, those products are going to be much more data-driven too. And it needs to have a consistent level of user experience, a consistent level of performance, and a layer like Alluxio provides that to those folks, making it such that S3 performance is abstracted as well as a hybrid or WAN data that’s across a WAN line.

The other aspect of it is in addition to the collecting and orchestrating data from many sources, you’re also providing a much higher level of performance that can be tuned to the needs of the SLA, meaning data can be tiered and having most of it in memory or most of it in warm storage or whatever. So you’re able to actually deliver a consistent performance as well.

That’s exactly right. Sometimes people are finding that. It may change in the cloud. The object store wasn’t necessarily designed for these types of workloads, especially metadata or operations tend to be quite expensive. And so Alluxio addresses that and makes that all consistent, and it’s really required when you want to have a great external facing product experience. 

How did this company come about? 

First of all, I’m not the founder. Our founder and CTO, H.Y. Li, this was his research project at the Berkeley AMPLab at a time when a large amount of innovation was happening for distributed systems. Three big projects came out of there, Spark was one, another was Mesos for containers, and Tachyon for the data. His research was about looking at the future of heterogeneous frameworks, heterogeneous data, and how that creates a data-driven world that would be a new layer. And so that’s what he ended up creating and he made it an open source project that’s all Apache 2.0 based and then ended up creating the company Alluxio to support the continued development of the open source project as well as the commercialization of that. I think it’s got great pedigree coming from the AMPLab and a pretty exciting project. And we’re very thankful for the amount of committers we have and the community members that are out there.

I’m sure you’ve been watching with great interest all of the different things that have been going on in the open source world related to the use of open source projects to create commercial services on cloud platforms. And so some of those battles have provoked bitter resentment where the open source has just been essentially appropriated by a cloud vendor to create a service and then the open source community has felt ill-used and that their efforts were just appropriated. In other ways, there’s companies that have an open source version and then the cloud vendor—like Amazon—creates something that’s a commercial version of Elasticsearch. How are you going to navigate this world and protect yourself from the worst case scenarios?

I think that’s a fair characterization. And I’ve found it very interesting how the public perception has shifted over time and how Amazon is viewed as the predator and Google is viewed as the good guy. It’s quite interesting. This is my third open source company that I’ve been working before. And so my take is that open source is good. It’s meant for the community and it’s meant for people to use it in a free and permissible way. That’s why we have the Apache 2.0 license. And we believe in that, that’s really critical to provide that to the community. And it’s a separate question, how you monetize that. And so if people create an external facing service for Alluxio from the open source that the community built and we were a big part of that community. We’re a different type of open source project than, say, a community-backed open source projects. But we’re really excited to see that. We think that’s great for adoption. That’s positive and that grows the amount of usage, it helps the project grow as well. The monetization part is an area where we look at what are the target markets have, what are the types of companies that need additional enterprise features or additional support. That is above and beyond what a community can offer. And that’s our model. I think it’s been made out to be the bad guy versus the little good guys, and I think it’s a little more nuanced than that. 

In terms of the commercialization and monetization model, this is essentially going to be an open core sort of model where you have an edition that is fully functional and performant, but then you surround that with enterprise features that are available for a commercial license.

That’s correct. So yes, roughly, it’s called open core. And so we make sure that our open source project is useful and can be deployed in production. We have many, many deployments, some of the top ten internet companies in the world use Alluxio in production for free and we think that’s great for the community. But not every company has the type of technical expertise to run and maintain and commit back to and be involved with the community that those types of company do. And so for those companies, we offer a guarantee SLA type support as well as a hardened enterprise version that has enterprise features for that group of people.

What are some of the enterprise features that are in that upgrade pricing?

Yes. Those are typically in the security area, so different types of authentication as well as different approaches for partitioning the data that enterprises have asked us for and are comfortable getting both the support as well as those types of security features. 

Give me an example of what you mean by partitioning the data.

There’s a feature in our 2.0 release we’re really excited about, it’s cluster partitioning. And that feature allows you to have multiple frameworks work in a separated way because some companies have asked not to have the Alluxio data tier to commingle different data sources. 

When you see the future of the cloud moving forward, you know, it’s really interesting to see how Google is aligning itself as a data platform cloud. Amazon seems to be this generic utility computing sort of cloud with lots and lots of different services. And Microsoft seems to be creating a very enterprise focused cloud. How do you see the way forward with all of these different clouds? And how will Alluxio fit in and play with all of them?

We think that cloud is a key part of why data orchestration has seen a big resurgence. A lot of the data is in silos or data lakes in an on prem data center or multiple data centers. And what’s happening is that more and more data is being generated that may be in the cloud and might land in S3 buckets or other object stores. It might land in a data center as well. And so depending on the organization’s overall cloud strategy, they’re going to need to find a way to have the data available wherever it may be located. Now the alternative is trying to copy all the data to a single place. And we find that that approach of the consolidated, centralized data lake works for some, but not for all. And most organizations will be siloed across their BU boundaries also and so that’s why we’re seeing this growth of hybrid cloud or indeed multiple clouds. 

As you said, there is somewhat of a specialized cloud approach that each of the major players are using to get to the market. And so we want to side on the organizations that are dealing with that. And I had mentioned lock-in earlier on. You know, we think that it’s great that there’s competition between the cloud vendors. To have credible threat, you want to be able to have not only applications running in different locations, but also, your data available regardless of how it’s tied up in today. And so that’s where the Alluxio layer provides that capability and not only all the benefits that we talked about for being data-driven and not allowing applications to have the data from wherever they are, but it gives an organization less lock-in. And we think that’s pretty exciting for the multi cloud world that will end up happening over the next three to five, ten years. We think multi cloud is a real reality.

What is the coolest use case that you guys have seen of Alluxio?

We had a large financial company that had on prem data in a big Hadoop lake and they wanted to run a large amount of Spark workloads against it to train some of their models. And they couldn’t necessarily just buy more compute nodes on-prem and they also didn’t necessarily need to have it all running all the time, so a cloud computing way made a lot of sense to them. That said, they weren’t ready to necessarily copy all the data into the cloud and then run their compute there and that’s for a few reasons, some was data sovereignty, data governance reason. They were able to get comfortable with running their compute with Alluxio in the compute nodes and having that run ephemerally with thousands of jobs running in the cloud. And it’s accessing the on prem data. Now they happened to use a non-persistent mode of Alluxio and that got them very comfortable. And the end result was they were able to create their models four times faster and that was very meaningful for their business. So we think that hybrid approach to doing analytics, starting to adopt the cloud using the compute in the cloud, using Alluxio to access that data and make it perform and make it local, and then mounting an S3 or a GCS object storage and start to migrate is the way people are going to be using Alluxio in a common way.

The interesting thing about that last point you made is that it allows you to do a migration strategy as gradually as you want. 

That’s exactly right. With our 2.0 release that happened just earlier this month, we announced the ability to do policy-based movement of the data. Now you can move it from on prem to cloud, vice versa, and do that in a way that is gradual so you’re online while you’re doing all this. What’s really interesting is that there’s different tiers of data and based on policy, you can now move that around. And in addition, you may have data that’s very hot that you want to be pinned to all of your worker nodes. Well, in that case, you can run a policy, for example, keep your last 14 days of data, those objects, let’s keep them in every worker node. And that’s something that is very difficult to do on a manual basis and keep up to date, stitching all that together with a lot of scripts. A data architect, data platform team, data engineers, they get really excited about this kind of tooling. As you mentioned, the Hadoop ecosystem was a very complicated system to run. Now we’re making that much easier by having this layer in between for the data platforms. 

This was the Early Adopter Research podcast, Designing Enterprise Platforms, where we attempt to look at technology that would be interesting when you are creating a platform to create a large amount of value in your enterprise.