Creating the Right Data Pipelines: A Podcast Conversation with Datacoral’s Raghu Murthy

On this edition of the Designing Enterprise Platforms podcast from Early Adopter Research (EAR), Dan Woods, founder and principal analyst at EAR, speaks with Raghu Murthy, founder and CEO of Datacoral. Datacoral is a product that is focused on automating the creation and maintenance of data pipelines. Their conversation covered how technology can create value from data. Now that every business has a wealth of data about its operations and its customers, there’s a huge number of products focused on making this data valuable. 

Their conversation covered:

  • 2:30 — Datacoral’s origin story
  • 10:00 — Datacoral’s product dogmas
  • 35:00 — How Datacoral constrains the problem of creating data pipelines

Listen to the full interview or read an edited Q&A of the discussion below:

Woods: Let’s get started with the origin story. What’s your background and when did you realize that you had to create this new company?

Murthy: My background has been in distributed systems and databases. I’ve been an engineer most of my career and worked in companies like Yahoo back in the day. I’ve been working on data systems for a while now. And in those times, at Yahoo as well as at Facebook, I ended up working on an overall end-to-end data infrastructure stack. So this is from collecting data from different places, organizing the data within databases, and even actually building the databases themselves or building the query engines. After five years of working on data infrastructure and having dealt with that kind of scale, I’d  decided not to work on data infrastructure anymore, I decided to work on the mobile team and an enterprise application platform. And about four years ago, I was at a VC firm called Social Capital. One of the key things that caused me to move away from data infrastructure was the fact that everything around data infrastructure seemed to be around cluster management. So you’d build these distributed systems and these distributed systems were kind of deployed as these clusters of multiple machines. And providing a data infrastructure as a service for the rest of the organization for us meant that we were just managing these clusters, and they’re getting to bad states so we’d nurse them back to health and once they get back into health then there’s a whole bunch of work piled up that then had to be caught up. 

And then these clusters could be either a cluster of a bunch of data in a big file system, like an HDFS file system that was organized into a data lake, or the cluster could be like a massive Teradata infrastructure that was used for analytics or—the idea is that the clusters were all purpose-built for a certain function.

Yes, exactly. So at Facebook specifically, so we did have a fairly large HDFS cluster and a MapReduce cluster that goes along with it. There’s also layers about the data warehouse or the query engine, which is that being able to collect data into the data warehouse. So again, if you’re collecting petabytes of data, then again you need clusters of machines that are just receiving the data and writing them out into the distributed file system. And once the data is in the file system or in the data lake, then you want to actually transform that data. And just orchestrating these jobs themselves required a distributed system. So that was its own cluster.

You were essentially the coal face of the hardest, widest, most scalable data supply chain in the world. I mean, you could compare it to only a couple of other companies, like Google perhaps, Amazon maybe, that had a data supply chain that this is wide and this deep and this scalable. 

Yes, absolutely. It’s interesting that when you think about data infrastructure, it’s actually growing geometrically. So for example, Facebook had less than 100 million users back in 2008. And in five years, they went up to like a billion users. So it was like a 10x or 12x growth, which is pretty amazing. But then when you think about the data volume growth, it is more like 4,000x or 5,000x. There are more users using the product more, which means that they’re generating a lot more data. Facebook got better and better at using data to build better products, like make ads. That meant that the data literacy of the company was growing. And that means that there are more people doing more things with the larger volumes of data. So now if you’re providing a central data infrastructure as a service, now that service or that infrastructure needs to grow as a multiple of the data volume, the data diversity, and the data usage.

What comes out of this are some principles of how to do this right.

Yes, absolutely. And there are many things that we had to kind of keep reinventing. One of the things about data infrastructure is at some scale whatever system works, if you take it a couple of orders of magnitude higher, then that particular system is just down on its knees. 

Let’s discuss the product dogmas of Datacoral. By the product dogmas, I mean the core assumptions you’re making about the problem that exists in the market: why that problem exists, why that problem is growing, and how to address that and solve that. How did you choose the right scope?

I don’t think I answered your previous question, which was how does that web scale data infrastructure, how does that translate into something that enterprises need? Although the problems are the same, the scales are fairly different. One of the main things that I’ve learned about data infrastructure in terms of how it scales is that you want to look ahead in terms of what kind of scales you might get into and choose the right technologies up front so that you don’t have to re-architect something before you start scaling. Then you start looking at a small company or a growing company, which has maybe just a few hundreds of gigabytes of data, you should be thinking in terms of tens of terabytes scale already if you’re planning for your company to grow. So again, a growing company typically means that they’re leveraging a lot of their data. And the data volumes also start growing fairly quickly. With companies I’ve worked with, once the data is centralized, they needed a whole bunch of automation to be able to even transform or get the data to get to a query-able state as well as do a bunch of the analysis. And then finally, to be able to do the analysis is one thing but to be able to leverage the analysis and just showing dashboards to people, you want to make that data actionable. So I talked to company after company that kept telling me about these same kind of problems. And clearly, I’d worked on these problems at much, much larger scales and it was clear—why is it that there is something that’s not available for all of them?  These are all cloud-native companies. So they are using the cloud quite a lot so they know how to use other services. They are okay using other technologies or technologies managed by others. But then what they realized is even with that, they need a significant amount of engineering to put those pieces together. 

What you’re saying is that first of all, one of your core dogmas is that data will grow faster than you expect and it’s wise to prepare for massive data problems, that data infrastructure falls behind much faster than you expect and you have to make sure that you can keep up, that work on data infrastructure crowds out the work on the last mile of making data actionable, and that while the cloud vendors provide raw materials and vendors themselves also use those raw materials, the key value creation is figuring out the right architecture so that you can meet your own business needs. And neither the cloud vendors nor the vendors themselves are really selling you that architecture.

Right. These reference architectures are great to learn what might work in an idealized environment. But then, company to company, the requirements are fairly different, both in terms of data volumes and data diversity. They’re not just recipes that you can press a button and get the entire thing set up for you. 

The idea now is that you realized that you had to choose a certain problem to solve that you felt was important and that would address this. You didn’t choose to create a new database. What was it you decided you were going to create and why?

When you think about data in general, it’s sitting in a bunch of systems that are being used for you to build a business or databases that are using your application. You might use tools like Salesforce or Zendesk or any of these other SaaS services to actually even build out your business. And you might have applications that are generating the kind of data based on your user activity or whatever else. So your data generators or your data sources are already there. These clouds or any other kind of providers, they give you data warehouses. If you put data into them in the right kind of layout, then queries on that kind of data would actually be pretty fast in these warehouses. And of course, data visualization has a lot more to do with human-computer interaction, like how do you visualize data and so on.  And that’s also something that there are actually pretty good tools for. So a Tableau or a Looker and those kinds of tools. The last mile, though, is getting the data to be actionable. It’s kind of internally acknowledged that data goes into dashboards to die. What that means is that people build dashboards, they’ll start looking at the data for a little bit and then they realize one of those little widgets, the data is not getting refreshed. So if that dashboard is actually super important, then there’s this huge fire drill to go fix that particular widget. 

Ultimately, you’re saying that the problem is that what you don’t have that you need is the ability to easily create data pipelines that extract data, reshape it into whatever form, and then deliver it to wherever it’s needed. Not necessarily by building a dashboard but by delivering it to the environment that you will use, whether it’s a spreadsheet, whether it’s a data discovery system, or an application or whatever you want.

Right. So this whole white space that I talked about is how do you make sure the data is actually flowing consistently from your sources into a database where it can be queried and then there are still more systems that are needed to be able to operationalize the transformations within the data warehouses themselves. And once the data is actually transformed and is ready to be consumed, being able to push the data into the systems where the users live. 

Your assumption is that there is a need to create and automate the creation of data pipelines, there’s a need to automate the management and evolution of those data pipelines and if you could do that, then all of a sudden, the dashboards don’t die anymore, they stay fresh and alive because they have updated data. And the number of people that can participate in creating data pipelines goes up.

Yes, so making it low code or more declarative rather than using scripting languages means that analysts or people who just know SQL, they should be able to piece together end-to-end data flows. They’re not actually doing the plumbing, they’re just thinking about the data itself, the shape and semantics of the data. And that results in the pipeline getting automatically generated and run and kept consistent and up to date and all that good stuff.

In terms of a product vision, I’m assuming that you have well-behaved data that comes as an entry point. And if you’re doing something heavy data engineering or heavy data prep or data quality, you want to do that before it arrives in Datacoral. And then in Datacoral, you don’t necessarily want to be doing heavy duty data transformation inside there.

It depends. There are data prep tools that allow you a visual way of specifying how to clean the data. But those tools in some sense come up with the rules that are needed to clean and enrich the data. But then you want the cleaning of the data to be done in a synchronous way to net new data coming in from different sources. 

You’re saying that you aren’t going to be doing data engineering inside Datacoral but you are going to be doing more than just trivial joins? 

Yes, so Datacoral provides you that kind of glue that can tie all of these systems together. Think of somebody building an ingest pipeline. So instead of thinking of building an ingest pipeline, if somebody told you that by just writing a SQL statement, like select from Salesforce connection parameters into my table and then automatically an ingest pipeline gets generated and data starts flowing, then that’s a much different experience and it’s something that’ll enable or empower a data scientist or a data analyst than saying, okay, I now need to ingest data from Salesforce so let me write a whole bunch of scripts and build out the pipelines that are needed to fetch data from Salesforce.

This is where you’re talking about the low code aspect of it. First of all, what you’re always building in Datacoral is streams to update. Second of all, you’re building it using a low code declarative method, in other words. And you can do some of the messier data transformations if you need. Now, do you actually drop people into Python if they need to write code? Or do you have your own transformation language or how does that work?

Yes, so actually I’ll kind of go back to this in terms of SQL not just as a language but also as like an abstraction philosophy. So SQL is pretty opinionated about the kind of abstractions that it provides. One is it says to the user who’s writing SQL, don’t tell me how to do something, just tell me what you need. But then there is another very key aspect of it, which is that it tells you to be very clear on what the data dependencies are. Think from the clause of the SQL query, right? So I know exactly what data I need to be able to answer the question that I have. So SQL has made these kind of two choices of the declarative language and being very explicit on data dependencies that allows an underlying system or database to be able to run very efficient kind of data processing to be able to give you the answers. When you think about building a data pipeline, it’s the opposite. You’re saying run this job and then this job and then this job. But then because all of this is written in Python or some scripting language, you don’t really know what the data dependencies are, they’re all hidden inside of these scripts. So what that means is that debugging these pipelines is pretty hard. So what we said is, okay, let’s apply the same abstraction model that SQL has, which is just declare but then make data dependencies explicit. But then we’ll do it for end-to-end data flows. So a user who just knows SQL, as I mentioned, they can just get data from some external system by writing something declarative and then data starts showing up in a data lake or a data warehouse. And at that point, you have full access to the SQL of the data warehouse to specify all the transformations. 

The first part of the story is I’ve got 20 different sources and I’m going to flow them together into tables that are connected in various ways where the data dependencies are clear and those dependencies are the landed data, what I’d call it. And once you’ve got the landed data there, then you can create the modeled data, which is the next step, using views, which are expressions of transformations on the landed data and maybe enhanced with some other transformation assistance.

Yes, exactly. SQL thought about this a while ago. So there’s notion of user-defined table functions. What that means is it allows you full freedom to do all kinds of processing on an entire table’s worth of data and then spit out whatever the resulting data set looks like. So we have leveraged this notion of table functions to be able to allow people to specify much more complicated transformations that they would need, which is not directly representable in SQL.

That gets you to what I call the modeled data, which is a data that is purpose-built for a general purpose in your enterprise. Then the next step is usually the purpose-built data, which is the data that is for one specific dashboard application, to support an autonomous system or whatever. And so how do you go from that modeled data to the purpose-built data?

Cleaning up or prepping the data is one kind of transformation. But then producing purpose-built data is just another kind of transformation, which is just downstream. It’s just views and views on top of other views.

At the end of this, you have a data set that can be accessed. Is that delivered as an extract or is that delivered through a query? 

The last part of the end-to-end data flow is making the data actionable. We allow people to even specify that, take some data set that’s sitting in your data warehouse or data lake, and publish it into applications. Or of course, because these data are living in a database, you could have a visualization tool directly connect to the database and pull the data as well. What we offer is a way for users to choose what they want to do with the transformed data or the data that is actionable and then to actually take the actions directly as well.

I assume you have your own declarative language at some level to allow you to assemble this stuff?

Actually, there is the extract and load, which is what we call collect. We will come up with a SQL-like syntax for it. But most of the transformations, they are specified in the SQL dialect of the underlying data warehouse or data lake. And the published side is the dual of the collect side, which is that you can also say, select into Salesforce parameters and into this object from this particular table in the data warehouse. And that results in an egress pipeline that can take this data and convert that it into API calls that go into applications. They’re using SQL for the transformations. But then once they decide that they want to take that SQL or take the transformation and make it into a pipeline, that requires zero additional code. 

You specify the transformations and then the whole architecture of this is instead of having to say the how of the pipeline, you just declare that you want a pipeline and that you want it to reference these data, these transformations, and then it starts existing and running.

Yes, exactly. Because it’s all data dependency-based rather than job- and task-based, you have somebody who knows how to write SQL who’s saying, okay, I know that I transform data this way and I’m going to name it and then I can use this view in other queries. Databases have this notion of materialized views but then there are a whole bunch of constraints around a full implementation of materialized views. What we have done is constrain the problem around materialized views. 

The way that Datacoral works is you have a collect component that brings all the data in. You then, let’s say if you wanted to have a certain subset of your Salesforce data represented in a Datacoral pipeline, then you would define that in the collect function, then that would be brought in to whatever data infrastructure. You could transform it on the way into its materialized state or its view and then you can transform it on the way out using SQL and user-driven functions and everything. And then you can create as many layers of transformation as you want to create, as many layers of views but then all of a sudden, the change data capture, the data logistics, to make sure that at any point, that view is up to date?

Yes, exactly. That’s literally what we believe our defensible advantage is. Given that people are specifying something declarative, we capture metadata and we make sure that there’s a consistent kind of view of how the data is flowing. And that’s what allows us to provide like a robust pipeline underneath. The second thing that we have done is build out this whole underlying system using purely serverless technology. What that means is that the underlying infrastructure itself or the pipeline implementation is really robust. We have gotten away from doing any kind of cluster management because the entire pipeline is serverless.

You support a variety of sources of data. And then what kind of data warehouses do you support? 

We started off with Redshift because we are on AWS and AWS also started offering Athena, which is Presto as a service, so we added support for that. And there is Snowflake, which is becoming super popular nowadays so we have support for that as well. And we plan to add support for more warehouses in other clouds as well.

When somebody tries to access data, they access it either through going to a directory where the data has been published and there’s a file there or they query into the data warehouse using SQL.

There might be data that’s sitting in files but we are able to provide connectors directly into tools or databases. So if you have a production database, we have a serverless ingest that can fetch the changes from your production databases and then apply those changes into your warehouse. The consumption end is also similar, where you can publish to a production database as well as publish to tools like Salesforce and other kinds of applications. 

It sounds like what you’re doing is taking something and then productizing that and putting declarative statements on the top so that you don’t have to do all the coding but it’s all sort of generated? 

Yes, absolutely. Each of these different systems requires engineering. So you’re building these jobs and tasks, you’re building all of these kind of scripts, which is essentially combining the business logic of the data—the actual transformations you want to do—along with the orchestration logic, as in run this job and then this job and these are the dependencies and so on. You’re explicitly hand-coding these things. So this almost feels like assembly language. So because you’re having to piece these things together manually, that is why you require a lot of engineering. We have taken the data pipeline, which feels more like assembly language or more low level, and we have provided a high-level abstraction layer that allows more people to be able to build really robust pipelines.

What is the sweet spot for the companies that are adopting Datacoral? 

What we are building is actually fairly broad. And we believe that we can solve for very different kinds of use cases or business use cases. But the similarity is always that companies who are trying to up their data game. So either they have nothing in terms of a data warehousing strategy or they have a data warehousing strategy or they’re building one and they’re not happy with it and they’re trying to rethink how their data warehouse is actually going to be valuable for them. For now, we are going after companies that are midmarket to early enterprises, where they’re growing, they are finding that whatever systems they have, either they want to rethink it or they want to do new data warehousing projects. So we tell them, okay, you can choose your data warehouse but then we can help you with the rest.

It sounds like that the sweet spot is companies that are struggling to build this data infrastructure to support a growing number of data pipelines, who realize that this is going to be a bottleneck for them and who start to do it and realize they don’t want to have to spend so much of their investment money on this type of infrastructure. The leverage they get is simplifying this data integration problem, this data pipeline management problem. 

Yes. One of the key things that I believe we differentiate in as well is that we offer our product as a fully managed service. But then because we are built in a completely serverless manner, it is fairly straightforward for us and we do it this way, we actually deploy all of our software within our customers’ environment. So companies that really care about their data privacy—and nowadays, I believe every company should care about data privacy—they find us attractive because they no longer lose control of their data. So no data is flowing through Datacoral’s systems. We’re finding the companies that care about data privacy, they really like that.

What are your predictions for how this problem will grow and how data infrastructure will change? And what do you see as the strengths and weaknesses of the large cloud players with respect to data infrastructure?

I think everybody will agree that the amount of data that most companies are having to manage is only going to grow and there will always be a need for these levels of abstraction that allow people who are your citizen data scientist or citizen data analyst to keep asking for better and better abstractions. And then it’ll be up to vendors like us and others to be able to provide these abstractions that allow these people to leverage data without really having to think about underlying systems.