The Future of Hadoop: A Podcast with Splice Machine’s Monte Zweben

In this episode of the Designing Enterprise Platforms podcast from Early Adopter Research (EAR), EAR’s Dan Woods speaks with Monte Zweben, the CEO and cofounder of Splice Machine. Like Woods, Zweben has been an observer of the Hadoop ecosystem for quite some time. Splice Machine is a distributed SQL database but it’s also a database that is able to handle transactional workloads and analytical workloads at scale at the same time. Given that many of those analytic workloads were handled on Hadoop, their conversation centers on the recent happenings with Hadoop. 

Their conversation covered:

* 5:25 — The history of Hadoop and the problems it faces
* 21:00 — Why schema on read was a major problem for Hadoop
* 31:00 — What is needed for a true big data architecture
* 42:15 — Why scale out architectures are cheaper than scale up

Listen to the podcast or read an edited Q&A of the conversation below:

Woods: We’re going to talk about what has gone on in the Hadoop ecosystem and why Hadoop seems to be fading so fast. We may not be able to declare Hadoop’s demise right now but we certainly can talk about the negative impact that a lot of news recently has had on the Hadoop ecosystem. But first, you had a recent funding announcement for Splice Machine, so why don’t you tell me what happened with that and then we’ll start our discussion.

Zweben: Yes, we infused another round of capital to grow Splice Machine from its early stages of building this data platform to growing our go-to-market and bringing this to the marketplace globally. The leader of this particular round is a pretty well-known figure in the data spaces, Ray Lane from GreatPoint Ventures. And Ray used to be the president and COO of Oracle so it’s great to have Ray on our board and counseling us on our strategy and being part of the Splice Machine journey. We were also very excited to have a strategic partner participate in this round and a strategic partner who’s critical for our go-to market, and that was Accenture Ventures. Accenture Ventures through their data and software engineering practices are taking Splice Machine to market to their clients as a data platform for companies to modernize their applications with both enriched data as well as artificial intelligence. 

Now let’s get to the Hadoop ecosystem. Here’s my summary of what I think happened in the Hadoop ecosystem and then I’d like you to talk about three points that you wanted to make: Hadoop arose out of the need to process huge amounts of data. Google had been doing it and wrote a paper about an architecture called MapReduce that could be very successful for processing large amounts of data. And the people at Yahoo, Doug Cutting and others, created this infrastructure called Hadoop to enable large amounts of data to be processed. Now, initially the use case was the indexing of the web but quickly, as more and more data sources became available, it became clear that lots of other people had this big data problem, and Hadoop was an open source platform that allowed you to handle it. What it did most valuably was it created a file system that was much cheaper than any other method that allowed you to store massive amounts of data using commodity hardware. Then on top of it, they created various generations of programming environments that allowed you to write programs that would sift through all of the data in parallel using the MapReduce paradigm, where you would sift through and process a bunch of stuff and get an intermediate result. Now, what was the problem? For engineering-rich organizations it wasn’t really a big deal. Netflix was able to use it and created all sorts of plumbing around it to make it more easy to use, such as their Genie system. The problem, though, from a mass market enterprise uptake point of view, was that MapReduce as a programming paradigm wasn’t really that easy to use. Then they put Yarn on top of it, which made it a little bit better, a little bit more abstract, but still it was a real programmers’ tool and it really wasn’t built as an application development environment. It was built as a data storage and data processing environment. And then when Spark came along, which was built as an application environment, a lot of the focus on big data for applications went to Spark. At the same time there was another problem, which was that the Hadoop ecosystem, unlike almost any other open source ecosystem, had three major players surrounding the open source project. It’s a tremendous tribute to the Apache ecosystem that they were able to work together and have Cloudera, which was the earliest and best funded, and Hortonworks and MapR, all working on the open source project to some extent and cooperating but also competing to create a productized Hadoop distribution. But Hortonworks went public and their business performance wasn’t clothed in glory. Cloudera got the huge investment from Intel and then eventually went public. MapR never went public. Eventually Cloudera and Hortonworks merged and MapR just recently wasn’t able to get funding. This all slowed Hadoop’s progression, along with the cloud, which meant the need for the Hadoop file system to store big data became much less urgent because you could store all that data at scale in the cloud. Then other programming engines could process the data, whether it was in HDFS or whether it was in the cloud. The idea of using MapR or Yarn as an application development environment became less and less attractive, especially due to Spark and now there is a lot of bad financial news coming out around the Hadoop ecosystem. That’s my summary—is there anything you’d like to add?

I think you made an excellent summary and I’d like to emphasize two things. First off, the beauty of MapReduce in the beginning and then later the more mature evolution of those compute models with Spark is the fact that until then, computer scientists were really managing very few processors in their applications. Some developers were good at thread programming and getting a bunch of CPUs to work together. But massively parallel processing was reserved for PhDs and masters in distributed systems. There were companies like Thinking Machines and Naspers that were trying to crack the nut on massively parallel processing. The beauty of MapReduce was that it allowed you to abstract away from getting multiple computers to work together, dealing with the age-old issues of distributed systems, including deadlock and livelock and being able to deal with failures in a resilient way, especially when dealing with commodity hardware. But as you said, MapReduce and even the later evolutions like Hive and Spark still required you to think about parallelism in ways that were overly complex. Engineers and developers whose job it is to write code every day, they were able to use this. But it wasn’t really architected or usable in an easy way by IT people whose job is essentially to operate these kinds of engines. 

What you’re pointing out is that the productization of Hadoop never really was completed. It always was a programmers’ toolkit. 

This isn’t about productization in my view. I think Hadoop was extremely well productized in many ways. It is really understanding what the constituency or audience for that product is. 

It was a product for programmers, it wasn’t a product for enterprise programmers.

Exactly. Facebook and Netflix and my old company Rocket Fuel, we used Hadoop intimately and used it at grand scale to produce machine learning. But we had engineers who were extremely experienced developers using it. And so what we did at Splice Machine, when we started the company years ago, is realize this fact. This whole point is the reason why I started our company. It was to democratize this technology and get it in the hands of IT and application developers that may not be experienced with distributed systems. 

You had three points that you think explain the challenge and the reason that Hadoop didn’t have a strong uptake. The first one was that schema on read was a mistake. Schema on read is the idea that you have a bunch of data and that you form a schema closer to the time of use and you don’t have to have the schema defined when you store the data. Why do you see schema on read as being a mistake and how would you explain what it is?

Schema on read was a reaction to the lack of agility that relational databases had at the time when Hadoop and scale-out architectures were emerging. The reason that it was perceived that way is that the applications that were powered by relational databases were difficult to get to scale on the new, modern data sources that were coming into the marketplace and people reacted and threw the baby out with the bathwater. Because relational databases required us to think hard about our data ahead of time and organize the data into tables whose columns were typed, this made it very difficult for us to change our applications, and because these new data sources were coming into the marketplace quickly, we needed to change a lot. And moreover, these new data sources were quite voluminous. We were going to move off of them and we’re going to move onto these new architectures. And what the marketplace said was yes, on these new architectures, we have these really cool things that are key value stores or object stores and you could put anything in there. And these are scale-out architectures, meaning the data is distributed across many different servers so you’re not constrained by centralized databases. What followed was this notion that these were very flexible architectures, where you could store anything in key value stores, anything in object stores. This notion of schema on read came into play to provide the flexibility of getting data digitized. Schema on read means that you take data and you just store it as quickly as you possibly can from its source and you don’t spend a lot of time trying to curate it or organize it and structure it, you let the consumers of the data—the application developers who are going to use the data—worry about that and they’ll be the ones who are reading the data, they’re the ones that’ll put that data into a structured schema. Well, this was a huge mistake. 

It was a mistake because the idea was that you could think that in some way you could actually do without this work up front?

Yes. And that you would be able to ubiquitously use this data for multiple purposes if you just stored it in raw form on one of these scale-out architectures. And I think it was a specious argument. And the reason is that, especially at volume, you have to think hard about how you represent your data for any application. That’s true for a standard application that might be a customer interaction type of application or just some business application. It’s especially true for machine learning applications. Representation of data is critical. And when you did schema on read, the result was a large data swamp versus a data lake.

The idea is that the illusion of schema on read allowed you to create a data lake of some sort. At first it probably worked but then, when you came to scale, it did not really scale very elegantly and this is where you had the lack of productization that’s been addressed by companies like Dremio or Podium Data that came in and said, look, we’ll provide that structure that’s missing to that large big data storage.

They let you get at data from multiple sources and federate a query across many sources. But they don’t solve the problem of structure. Somebody has to structure that data in some way. I like to view this problem as the data centralization, the creation or the enablement of the enterprise to leverage data, as two sides of a coin. It has a supply side, meaning you have to get the data and supply it to some location. And then there’s the demand side. Somebody has to use the data. The data lake projects that relied on schema on read were half successful because they succeeded in supplying the data. But schema on read didn’t make it very usable. 

The second problem you said that helped stop Hadoop was complexity and the duct tape compute engines. What was the problem with the Hadoop complexity of the ecosystem?

Right. Now many pundits on Wall Street and in the press and even leaders of the Hadoop sort of ecosystem claim that that symptom of follow-on orders and expansion in accounts didn’t happen because of the cloud and that the Hadoop companies missed the cloud and everyone’s going to the cloud. But this particular point that I’m going to make now I think is also a very real symptom that will affect the cloud adopters of Hadoop-based systems. All of the people who are rushing from an on-prem implementation of Hadoop that was complex to a managed implementation of Hadoop in the cloud, they will remove the operational complexity of using Hadoop per se. But what they won’t be able to remove is the complexity of constructing applications. And the duct tape that you alluded to is the duct tape necessary to build these so-called lambda architectures. A lambda architecture is a term utilized by people in the community that is essentially an architecture that glues together an operational data platform for running an application in real time. That tends to be a key value store. Examples are Cassandra and HBase, Dynamo on Amazon, Cosmo DB on Microsoft—there’s a whole host of them. They duct-tape that to large batch processing engines that do analytical processing. Those are systems like Hide and Spark, and of course back in the day MapReduce, and they connect that up to streaming systems like Kafka. And then somewhere in this architecture is a machine learning library, usually, or data science workbench. In the more modern systems, that wasn’t officially part of the old lambda architecture per se but now a data science workbench is becoming more available. The problem is you need developers to literally connect these different data platforms together and keep the data consistent for all these different moving pieces. This is true on premise, this is true in the cloud, and thus this is essentially a big problem. 

So even though Amazon has a cloud instance of Hadoop in elastic MapReduce, it still suffers from that complexity and you’re suggesting that it will lose out to computing engines that are easier to program in and easier to manage?

There are a number of software companies that are pre-integrating these computational engines in various ways. We obviously think we do it a very special way but there are a number of companies that are forming architectures for application development of very advanced applications that make it easier than having to do this plumbing yourself. I like to use a metaphor of a car. When we need to go somewhere, we need to either buy or rent a vehicle to go. We don’t go out and find a fuel injector suspension system, a bunch of tires, and an engine and assemble them ourselves. We’re going to experience in the next few years the same thing happening in the distributed computing space. We’ll see this complexity go away and there will be providers of software engines that put the right pieces together to accomplish particular tasks. 

You’re saying that the right direction is with advanced ML applications that make all the plumbing to create high performance?

I definitely think that’s the right direction. There’s even a more significantly coupled environment that will provide that capability. Our strategy was to try to use the lingua franca of IT, which is SQL. And I think that there’s probably trillions if not more lines of code out there written in SQL. If you can capture that environment that everyone knows, then I think you’re on the road to actually delivering on something that’s usable. 

Now the third problem, which was the idea of the wrong focus for the Hadoop ecosystem when they focused on the data lake versus the application. 

Yes, in retrospect I see what’s happened over the last five years as being interesting groupthink in our marketplace. So with this new distributed file system that allowed you to have cheap storage of grand volumes of data, there emerged this massive thrust to get all the data in one place and form the quote-unquote data lake. Schema on read exacerbated this particular thrust. But the supply of data to a central location versus the demand of data and use of data was very bifurcated and everyone jumped on the bandwagon of creating that central repository of data. What unfortunately happened is that there were data experts at the table but the people who run the business from an IT perspective and the sponsors on the line-of-business side were arm’s-length from this particular data lake initiative. The data wasn’t curated because of schema on read, the data was complex to get at for the different workloads because of the different computational engines. My perspective is don’t think about the supply of data, think about the demand of data. And start with a line of business, start with the IT group that’s in charge of a legacy application that might serve that line of business and ask oneself how could this legacy application be modernized with new data sources? How can it be enriched? And even better, how can you incorporate machine learning models into that existing application to make it more intelligent so that it’s using that data in an experiential way to learn from the past so that it can make predictions of the future and actually be more informed, more dynamic, and more adaptive to the changing conditions of that particular application. If you start with the app and you start asking what data does it need to take it to the next level and what data does it need to incorporate AI, now you’re focused. 

As for Splice Machine, if you look at your strategy from the beginning, it’s been very similar in one respect to another large company that has had the same vision for combining analytics and transactional processing, and that’s SAP with their SAP HANA infrastructure. The whole idea of HANA was to create an in-memory database that is so powerful that it can be a transactional database and use SQL and whatever it needs to be but also it can be accessed as an analytical database at the same time. And if you are able to do that, then you don’t have to do a bunch of the transformations that you would have to do when you created data into a data warehouse. How would you describe what you’re doing as the same and as different to what SAP HANA is doing?

It’s a great question and I’ll even throw Oracle Exadata into the mix. These two very impressive engineering efforts are groups that we love to be associated with because we have similar ambitions. Our ambitions are to provide grand scale to the application developers of the universe and to provide a SQL environment that they’re used to, provide interconnectivity to all of the BI tools in the world and all of the ETL tools in the world and of course to all of the developer environments that wrap SQL, like object relational mapping systems, and provide this new capability to essentially integrate or unify analytical processing and transactional processing. But where we fundamentally differ is on approach. Both Oracle and SAP looked at this as an engineering problem to be solved as an integration of both software and hardware. And if you look at the infrastructure required for HANA and for Exadata, these are what are called scale-up systems. They are not systems with which you’re taking inexpensive, multiple servers that have a proclivity to fail all together because of their inexpensiveness and glue these together to create grand scale that’s very resilient. That scale-out architecture, the modern architecture that began with MapReduce that we talked about, is an extremely cost-effective solution, whereas the scale-up solutions probably have one or two more zeros on their total cost of ownership. They’re very performant. They’re extremely impressive engineered pieces of work. But on a total cost of ownership perspective, they are a completely different equation. Moreover, neither one of these scale-up architectures are built on the open source components that the data science community would like to use in their data science on an everyday basis, whereas the Splice Machine scale-out architecture has all of the componentry we’ve been talking about. So summarizing, scale-up is way more expensive than scale-out. We both try to integrate analytical and transactional processing. And moreover, those proprietary scale-up architectures don’t provide the open-source standards for data science.

So you would take an existing transactional application and use Splice Machine as the database for it so that then you could get all the analytical processing? Like, you’re not probably able to do this but the most extreme example of this would be to take an SAP ERP system and use Splice Machine as the database so that now all the data in that SAP ERP system would be available for the analytic processing you’re talking about. It sounds like the strategy is that net new or custom transactional applications would be built on Splice Machine and then you would have that analytical processing come along with it.

It’s a great question and it is our punchline and it is the fundamental difference of where we are in the marketplace and how are we going to market with Accenture and others. We do indeed focus on legacy and custom applications that exist today and may be overlooked in the digital transformation because they’re locked in old technologies and we enable those applications to become data-rich and intelligent. As far as I know, we are the only distributed SQL platform that would allow an existing application to run on it transactionally and to allow an in-database machine learning component. We modernize existing applications and we do it better than anyone else. There are just so many custom applications out there that are SQL applications that would benefit greatly. And we have a few great examples of this. One credit card company had a customer service application for dispute resolution, where you are creating cases where you don’t recognize a particular transaction, you’re asking the credit card company and the merchant bank that issued you your card to investigate this particular issue. The credit card company went to a digital transformation trying to reduce the number of days that are necessary to essentially resolve these cases, and in order to do this, they didn’t want to have to change their application. Their application was already on an existing, very well-known relational database management system and what they wanted to do was use that application. In order to reduce the dispute resolution cycle times to very small numbers of days, they needed to have orders of magnitude more data in their database. But the relational database couldn’t handle it. So Splice Machine today has essentially replaced the relational database management system. They’ve migrated the app and every time you swipe a credit card today, it streams onto Splice Machine seconds after and is available for any merchant bank or consumer or for the credit card company to access in milliseconds for years of transactions rather than just weeks of transactions. And that kind of new approach on old software is really the epitome of what we’re in the market of doing. The power of migrating legacy applications to a Splice Machine framework that gives you scale, is that it gives you agility of being in any cloud or on-premise environment and the in-memory and in-database machine learning capabilities are right in the fabric of the operational applications.

My last question is about larger corporate strategy for Splice Machine. If you’re going to focus on the application and if you’re going to try to do a platform sale, you’re basically selling a piece of infrastructure that then somebody has to use to create the value that you’re claiming. Why wouldn’t you try to actually speed the sales cycle and have the an application be the arrow point of your go-to market?

There is no doubt as we evolve there will be a set of reference applications that we and our customers and our partners like Accenture will develop. We’ll package these as reference applications that are good starting points for application development and over time they will become full-fledged applications. But as a small company, we crawl, walk, run. 

As an example, one of the reference applications we offer today is one that was built between us and Accenture. And that is an order-promising application that fits on top of SAP and enables you to have a real-time selling tool for a sales organization to assemble complex orders on the fly with your customer side by side and getting real-time order promising dates, sometimes called available to promise, on a line item by line item basis to be able to provide that customer with the answer to the question they always have, which is, “This is a great order, can I get it when I need it?” And that order promising capability requires a voluminous amount of inventory availability information and parallel processing for availability computations and that integrated to SAP is a very powerful application, which also has a machine learning component on it which will learn over time the lead times for orders that may be very contextual to that order, whether that’s weather data or bill of materials data, supplier data, that may take your expected lead time of shipping from a distribution center or manufacturing plant to the customer from what was two days and then it becomes 10 days because of some condition that exists, being able to learn those conditions and make those promises on the fly. 

So the answer is yes, you do believe that applications, especially reference applications in your situation, are a very good way of going to market because it does accelerate the understanding and uptake. But no, you’re not going to create your own productized versions, you’re just going to create reference applications. And I assume that your partners, like Accenture, will probably create their own reference applications that they bring as assets?

Yes, I think we’ll do it together quite a bit and I’m sure they’ll build their own as well.