A Perfect Data Fabric: A Q&A with MapR’s Ted Dunning
On a recent episode of the Early Adopter Research (EAR) Podcast, “Designing Enterprise Platforms,” EAR’s Dan Woods spoke with MapR’s CTO Ted Dunning. Designing Enterprise Platforms is a podcast about the deeper issues involved in creating big victories in enterprise technology. MapR is a company that has been around for most of the Hadoop era and has always taken the point of view that Hadoop was a great infrastructure that needed to be improved. MapR makes these improvements through the file system and then layering on a variety of other components. This is an edited version of their conversation which can be heard in full on the podcast.
Woods: MapR has departed from the Hadoop-only ecosystem and now are creating what I think of as a data fabric that is there to serve the most scalable and complex geo-distributed applications in the modern world. Do you agree?
Dunning: That’s absolutely so. That was our vision from the beginning. Hadoop was the way that people talked about scaling for a long time and so we definitely sold into that market. But our view always was where would compute be in 10 and 20 years from the beginning and what would we need to do to meet those needs? And it’s now 10 years after we started and the needs that we’re meeting are very much the ones that we envisioned.
The way I think of MapR is as a data fabric. And by that, I mean that it presents a surface area so that an arbitrary set of consumers can get access to data in a variety of different ways, through APIs that are specially created for them, through streaming, through more traditional access methods such as SQL. Underneath that is a layer that consolidates and integrates data from a variety of different sources and also makes that data scalable. And at the bottom layer is MapR’s ability to cache, to search, to store all sorts of different types of data in a unified format so that you can make the APIs and the other data access methods more powerful. Would that be a fair statement about where you’re at right now?
Yes, I think so. There was one distinction that was glossed a little bit in what you said, and that is that we view a data fabric as a thing our customers build out of the tools and components that we provide. And we add a new word, dataware, to denominate our stuff to distinguish it from the data fabric the customers build, and the key capabilities there are that data has to be made secure, has to be in the right places, sometimes multiple places. You may need a stream that has many entrance or egress points around the world but is some sort of global data structure. And you have to maintain plausible levels and usable levels of consistency across that entire regime.
When I last got my latest tour through what your guys were doing at MapR, it was interesting because the emphasis was very much on how MapR, in its data fabric infrastructure, could accelerate the adoption of micro services, and specifically the Kubernetes framework for managing microservices. I had always thought about microservices in terms of scalability, being able to arbitrarily create a network of services and then have a scaling point at any point in that network. But at MapR, the product marketing and the product managers talked about how the benefits of microservices were not just about scaling but were also about improving the applications, making them more flexible, making them higher value and higher performance, but also making AI and ML workloads work a lot better because of the isolation and the other properties of microservices. Why have you emphasized that MapR can help you adopt Kubernetes and micro services in an orderly way?
If you think about what containers do, they let you run compute in a nice environment that can be different from container to container so that you don’t have to have an exact matching of environments and so on. That’s what containers do alone and that alone is not very useful. It’s nice, but it’s not where you need to go. When you add orchestration to what a container does, you are able to replicate containers as necessary to whatever scale you’d like, and typically it’s more than one because you typically want to not be dependent on the correct functioning of any small piece of hardware. You also want it to be positioned within the cluster correctly for multi-tenancy reasons, for access to special capabilities and such.
What do you mean by positioned?
Well, Ellen Friedman gave a talk a little while ago about five things that executives should not know about their data systems. Not the five things they should know but five things they should not know. And one of the key things is you shouldn’t be able to point at a particular machine that’s doing a particular function in your system. You should have computing resources as a fungible thing, but in fact, you should not necessarily know at any given point that’s the database. You should have a system that positions these things for you and repositions them as necessary, either isolating them or co-positioning them. But either way, there’s a positioning function there. That positioning is part of scaling, that positioning is part of multi-tenancy. Those are all key constructs that come out of good positioning and replication. But the third thing that the Kubernetes must do is name your systems and your services. It should provide access. And names are magical things. Names let you have power in a very, very concrete way. It sounds all like wicca, sort of magical reasoning here, but names are important because they let you abstract away details that you should not know, like position, number of resources and such. Computing alone is only half the story. The other half is data.
Exactly. What you’re saying is that if I create a system and I am able to implement it in a way that I have control and abstraction on the replication of the services, the positioning and the naming of them, then all of a sudden, I’m in a better place, even if I’m not going to be scaling to a large degree?
You’re in a vastly better place because you’re able to have machines which are now relatively vanilla and can be replaced, you can push stuff around so you can take stuff out for maintenance, you can do operating system upgrades. You have now decoupled this lower level of managing your resources from the higher level of managing your services.
Before we started recording, you mentioned that this was all a matter of isolation and that the isolation benefits that you just mentioned actually have to precede the ability to scale.
Absolutely. The isolation must precede scaling. Or another way to say it is, isolation is a necessary precondition, although not entirely sufficient to scaling a system. If you think about data, the things we have to do with data include replicate it for safety or for bandwidth. We have to position it correctly. We don’t want two copies on a single rack if we can help it. We want to position data near its consumption so that we minimize network bandwidth, although that’s much less critical than it used to be by two orders of magnitude. We want to make sure that we position data onto technologies and hardware that meet the current operational requirements, either prioritizing cost, prioritizing performance. We might also want to make sure that certain classes of data are not collocated, that the data path to these objects never intersects. This is a security constraint sometimes because certain attacks which are not well-known can occur if adversarial processes run on the same computer. Colocation is a key thing and lack of colocation is also a key goal. The last thing that we need to do with data, and this is absolutely critical and this is the revolution of what file systems are and they mean, is to name it. Now, this is exactly what we had to do with computation, position it, replicate it and name it. That is what a Kubernetes does for computation. There’s a coequal partner with Kubernetes, the data layer, which needs to provide those same capabilities for data. It’s like Kubernetes for data. That is the thing that we’ve been building for 10 years.
The idea here is that you’re in a complex environment and you’re trying to make sure that you’ve spread your computing out, scaled it and abstracted it properly. If you don’t pay attention to the things that you just mentioned about replicating, positioning and naming data, then out of all of those services, you’re going to have to either create another service that then feeds the data to those services, but then underneath that service, you’re going to have to create something that follows that service around and provisions the right data to it. You’ve created a microservices approach to provisioning that data so you can have data access APIs that can keep up with your Kubernetes services architecture?
Yes. And that can scale coequally, can scale and position across larger possibly geo-distributed systems. All of that that you want for computation, you absolutely want for your data as well.
Let’s go through why this is an advantage for AI and ML workloads. The briefing that I had, they mentioned that people who aren’t even using AI and ML at scale, who are just getting started with it, are finding that they’re accelerating their progress not only in training models, but also their progress in going into production and managing the flow of that model lifecycle once in production. The way that you have delivered this supporting infrastructure, supporting data infrastructure for micro services is helping them with those kinds of workloads. Can you talk me through why this is a help?
One reason is that machine learning is a new way to code. Instead of knowing our domain and what we want to do, we’re going to write some code but then use data to learn either some parameters to our code or even to actually learn code. We’re going to combine those together to build our services. That’s a new way to do this and it has new risks, but it’s just another way to build a computational artifact. Machine learning in some senses is just another way to code. It needs to cause changes and so on. Now, a second point is that most of the process of machine learning consists of the bullshit, the logistics, the getting the data ready, the managing, the learning, the realizing you screwed up, the realizing you started with the wrong question and so you got the wrong answer. Once you finally got something you think works, then deploying the next version of it, deploying ten versions of it, monitoring it. All of this is far more work and a very different skillset. So it’s not necessarily all data science work but it’s far more work than the actual machine learning part itself. This is particularly true for the vast majority of machine learning applications, which are not these remarkable, amazing image processing, voice recognition things that get so much press. They are the simple things, like here’s a charge code on these transactions, that one’s wrong, probably, let’s put it for human review, or simple recommendation engines. The machine learning task sometimes shrinks down to almost nothing once you get the right question and the right data. Now, if 90% of the work is the logistics and bullshit, as I put it, then if you do that even 10% less efficiently, suddenly instead of 10% of your day devoted to machine learning, you have 1% of your day.
You’re saying that the data logistics are so much easier when you have a data fabric that you can just bring the data where you need, make it accessible, if you’re scaling it, you can easily have it be accessed by scalable services, and then you’re not struggling with lots of scripts and lots of gradual data pipelines and all of this stuff?
If everybody worries about everything all the time, nothing gets done. But if I can focus on one task and not pay attention to some parts of the problem, if I can have separation of concerns, then suddenly a huge amount of progress is possible.
I’d like to talk about the alternative approaches to micro services that sometimes work better when you don’t have a top to bottom rewrite of your software possible. In application migration workloads, things like NGINX work really well as a very direct and simple micro services implementation so if you have a big monolithic application, then you abstract away the UI and create a new UI. Underneath that UI, you then access the monolith one way or another to get it in one huge service. And then maybe you have new services that are then powering different parts of the UI that’s expanding. As time goes on, maybe you’re able to break that monolith into two or three services and then five or ten. Then all of a sudden the monolith is no longer a monolith. Just like you’re talking about the data logistics being helpful in a ML application, it seems like if you’re going through like a microservices-based application migration scenario using something like NGINX in the way I described, this data fabric could really serve that use case as well.
Yes, absolutely. If you think about it, when somebody starts describing a system, even a monolith, they almost always start drawing squares on a whiteboard and say this is some function, and then they draw an arrow and then stuff goes to this other function, and then they draw an arrow and then stuff goes over here. So they have in their mind something that’s services orientation. But remember also that the definition of microservice has nothing to do with micro. It has everything to do and only to do with independent deployment. And so your monolith could turn into a microservices architecture without ever being split at all if you just add another monolith next to it that is independently deployed. They are now micro services. Having data in the form we like and accessible and having primitives is just a way of codifying a communication path. In a data layer, you should have comparable primitives that allow you to do alternative forms of communications. Data layers take away the effort and tedium and silly effort all the way from building services or breaking up monoliths, if you have a good data layer.
What you’re saying is that if you want to accelerate your adoption of Kubernetes, of microservices, you had better have a plan for how you’re going to be delivering data and handling all of the data related issues, whether it’s the consistency you mentioned, the replication, the positioning, the scalability, and also building robust, powerful APIs that can actually consolidate whatever data you need. Being able to access the data in whatever method is most natural to the application. The argument you’re making is that MapR will allow you to do as much as you want at the microservices layer with Kubernetes and then have the data layer keep up without being a delay in having lots of heavy lifting.
Yes, and without being a distraction. It lets you do what you want to be doing and not have to do a bunch of stuff you didn’t want to be doing that isn’t getting you toward your goals.
Are there any other related issues to this that we should be talking about?
Just to put an underline under that, that isolation sounds like one thing, but that catch phrase that you wrote down there about we shouldn’t have everybody worry about everything all the time, that is an isolation motto so that we can put some isolation around what we’re worried about. That’s the isolation ultimately that we’re talking about. You have to build it in at a technical level, you have to isolate your systems so changing one doesn’t require you change all the others. Deploying one, redeploying doesn’t mean you have to redeploy all the others. Essentially, this isolation is not just a human thing, it’s a technical thing as well, and an architectural thing. It helps us build systems that are enormously more efficient and enormously more reliable and robust than otherwise. And having been on 24-hour call for new systems, I don’t want to do that unless I know that I’m never going to get the call, or nearly never. But I’m happy to do it if I get a call like once every four years. I’m happy to have everybody on the team on call if it’s that rare. That’s the sort of thing that we’re trying to build.