What Works for a Modern Data Architecture: A Podcast with CTO Dan Seltzer
In this edition of the Designing Enterprise Platforms podcast of Early Adopter Research (EAR), EAR’s Dan Woods speaks with Dan Seltzer, the former CTO of Credit Junction, about modern data architectures. The term modern data architecture has been used by vendors to mean a variety of things but Woods’ conversation focuses on the transition from the data architecture of SQL databases and data warehouses, and Hadoop, to the cloud. Their conversation covered:
*5:15 — What do we mean when we say modern data architectures?
*18:30 — The need to know where your data is without using complex code
*33:00 — The functions of data catalog
Listen to the podcast or read an edited Q&A of their conversation below:
Woods: Your career has been essentially as a very engineering-oriented CTO who’s been involved in many different kinds of activities, from consulting firms to product-based startups to being involved in the belly of the beast of the large enterprise. How would you describe your expertise and your career?
Seltzer: 37 years of building things at increasing levels of scale and complexity, lots of industries, finance, education, health. And you’re right, it’s very much hands-on. I’m wary of using the word engineer because I try to surround myself with the people who do that better than I do. I think I live at the intersection of evolving business needs, new business models, transformations, and emerging technologies and how you use them.
The way I think of what’s going on today is that ultimately people are being asked to deliver what I call data products, which are data that is well-curated, well-organized, and have some semantic model and a user interface. That data product is what is being presented to the users and is used in the context of a business. To create that data product, that’s where you need a modern data architecture. The layers underneath are a first layer that has the landed data where you are taking data from a variety of sources and then you land it somewhere. You can do data engineering before you land it, and then once it’s landed, you can do a variety of transformations in whatever repository you have. Now, after the landed data, you then do some sort of transformation and you create reusable data objects. This is what I call the model data. The model data is about crafting the data, cleaning it up, integrating it for a particular purpose. Now the model data may have many layers of models because you might have a canonical customer record, a canonical product record, some other type of records that are intended to be reused widely. On top of those, you may have reusable records for a certain use case. And then on top of those, you have use-case specific, and that’s the last level, which I call the purpose-built data where you take the modeled data and craft it for a specific purpose. Now, through this you have a variety of different ETL processes. Usually the ETL from the landed data to the modeled data is a little simpler and the modeled data to the purpose-built data is even simpler. The more you can make this all self-service, the better. This has changed from an environment where all of this process happened inside a data warehouse, to a new architecture in which the landed data may be an entire repository of object storage, or it may be a relatively unformed repository like a data lake. Now, when we talked about this the last time, this architecture brought up a variety of responses from you, and that your idea of a modern data architecture was a little bit different.
I would start by stepping back and talking about why we institute this architecture and why don’t we just keep doing what we’ve always done? One is the number of data sources that an enterprise needs to consult to answer business critical questions is greatly increasing. It used to be that, to answer the question of how many widgets did we make, how many did we sell, where did we sell them, you had your in-house custom app or you had your ERP system. But increasingly the questions that we want answers to, which is really what we use data for, it’s to answer questions where there’s a predictive model or a straightforward ad hoc analysis, and those questions are depending on data that’s increasingly distributed. It’s no longer simply the case that you put a bunch of engineers working on your internal system to push that data into a warehouse. You need to be able to pull from lots of systems. A second issue is the nature of the questions that we’re asking as an enterprise, and the time in which we need to evolve those questions, is much shorter. So there are more types of questions, which are can we make a prediction, can we look at a time series, can we spot anomalies. It’s no longer just what’s the total, what’s the total last week. But those questions are evolving quickly because businesses are having to respond to changes in the market quickly, so we need something that can evolve answers faster. And then the scale of the data is getting much larger because we’re able to get so much more fine-grained data. The kinds of architectures we’re evolving toward have tools to provide relevant, meaningful, timely answers to core enterprise business questions, but they are operating in a new world with far more data and complex data sources. They are productizing data. It’s a distributed solution. Older warehouses were typically point-to-point. Today a more distributed system means you have a lot greater variability on your inputs. So as to the layers, I would say landing zone, initial transformation, productize it and as you said, there’s a progression from data that’s closer to how you found it in the source to data that’s closer to the analysis you want to use. And ultimately, the last step is the analysis, which is what you want to use. One of the most important concepts, I think, in modern data architecture is the transition from what is referred to as schema on write to schema on read. Schema on write is a traditional SQL warehouse where you have decided up front what the schema is in this relational database that is your warehouse.
And this essentially requires some sort of object storage where you can store a piece of data without having to say anything about its structure. You have developed a variety of methods where you don’t like to be locked into a product. You would rather create that structure in a way that’s more independent of any product, and you use open source tools to do that. That’s one of the most interesting aspects of what you’ve built in your most recent startups that I’d like to explore.
The motivations for that are it’s a risk factor. And I think it’s one of those risks that people make without understanding that they’re making. So for instance there are ETL products and BI products and other things that will allow you to express that model that you’re referring to, which is describe to me the shape of the data that you want, in their product. So if you look at Looker, Looker uses LookML to represent this kind of data organization, data model. The landed data is actually inside of the SQL data. What they’re doing is they’re creating the model data using LookML. Which is their proprietary language. And my concern is that that expression of a model is a critical intellectual property asset of the enterprise. It’s a critical piece of the data product development process and it needs to be free and clear, open, transparent, portable, versioned, governed. I’m very uncomfortable taking the inconvenience of putting it into a product without fully understanding what you’re giving up down the line.
The way Looker works is that LookML is treated like source code. So it goes in a source code repository where it can be versioned and it can be controlled. But there are a variety of other modeling environments where it’s much less like code.
I don’t mean to pick on Looker. It’s just that that piece is such an important piece in the rest of your data platform that it feels like one of the core assets that you want to own and control.
How do you build that in a way that achieves the goals for it that you mentioned?
We are still trying different things. It represents a significant investment on the part of an enterprise both to do the work to figure out what those model representations should be, to choose an open-structured platform for expressing them, and then to evolve and maintain them over time. In the past we have been expressing them within source code but in a sort of a declarative or functional paradigm using languages like Python and Java in ways that both express it descriptively, so it’s approaching data, but is versioned and managed through the engineering process. It’s not ideal. In the case where we’ve implemented in functional Java, we’ve defined a fluent way of expressing those. The target model, where it comes from, how it’s transformed, is in a way that is a little domain-specific language implemented in Java. So you get the benefits of all the things of it being in the code. It’s processable, once it’s processed it constructs objects in memory. To me, the specific implementation is not as important as the fact that you are describing these transformations.
And they are abstract in that they can be rendered in other different code, that’s the real news here.
Right, because the bad place is when it is like the old thing with stored procedures. If your critical business logic is living in PL SQL, in an Oracle-stored procedure, how are you going to know where it is, who changed it, versioned it, et cetera?
What would your problem be with something like one of these ETL platforms like Talend, Syncsort, or Attunity where you could express the transformations from the landed data to the model data in an abstract way?
I’m arguing for an analytical approach to selecting them. So there are too many platforms out there, and they’re evolving too quickly, for us to keep on top of what everyone does. But ultimately, when we are trying to solve a problem, we do a quick, criteria based analysis that says cost, lock-in, performance, security, governance concerns, whatever the things are that determine it, and for a given context, a given organization may or may not want to go that path. Ultimately we’re all taking lock-in.
Once you built the data product, then you have to allow people to find it and then also to adjust it if they need to. What have you noticed about what’s going on in the current marketplace that is helpful and what is missing?
You’ve described it exactly, which is the value of the data lake, or whatever your data platform is called, is directly related to the ability to find what you need in it to construct some new analysis to generate business value. That finding in the most naïve implementation is you ask other human beings. So the introduction of a data catalog is something more structured, more persistent, more shared, more accurate, more performant than just asking the people you know. We have done some experiments using documentation generators from attributes embedded in source code. So I think there are a lot of ways to get there. My instinct says that the hardest part is keeping the data catalog up to date. And so to the extent that you can actually generate that data catalog out of those expressions of a model and the transformations would be ideal because you have the lowest additional cost and the lowest delta between what’s documented and what actually exists. I think it’s a pretty ripe area for product development. I think where you’ll see the accelerated evolution of those meta-data catalogs is when there are powerful tools that leverage those to shorten the time and cost to delivering a new analysis.
I recently wrote an article called “Are Semantics the Undead of Enterprise Software?” I posited that like a zombie, the idea of using formal semantics against data to describe the structural meaning of that data is an idea that just never goes away. It seems like part of the things you achieved at your last startup was that you were able to divorce the data from the semantics and have a semantic, descriptive layer that was driving the behavior of the system and would allow you to point both the transformation program and a set of semantics at a data source and then out of that you would get the data that you wanted into the canonical model without having to rewrite that program every time. To what extent do you believe that you had a formal semantic model there?
I love this topic. I think it’s helpful to make a distinction between analytical contexts in which your data already is semantically clear. I’m thinking Spotify analytics. Playlists, who played them, how long, how long have they been on. Those are pretty cut-and-dried, single-source. Inside Spotify I’m sure someone somewhere can get an authoritative answer on whatever the question is. But if you had access to playlists across multiple services, Spotify, Apple, whoever else, the semantics of each of those different enterprises would be slightly different. And in order to make valid and valuable analyses across that combined data set, you would need to regularize the meanings of those semantics. In my last startup at the Credit Junction, we had financial data reported from lots of different businesses using different accounting packages. And the meanings of assets and liabilities at a finer-grained level of invoices and inventory differed tremendously from business to business and report to report. So in order for us to be able to take advantage of the kind of data platform ability to make cost-efficient, fast, evolutionary analytics, we had to regularize them. We’re saying to operate on a set of data, it must be consistent or our answers are not going to be consistent. When your data comes from disparate sources, and those sources even can be within the same enterprise, you have to regularize them.
You said it’s much better to have them be abstract, and now it seems like that the semantics are that same level of abstraction. If you can posit against a set of data a set of semantics, you are now separating a lot of that meaning that is normally embedded in code. You can have the code be less bound to the programmer’s understanding of that field name and you can have a description that is at a higher level.
You do have to separate and make trade-offs. To get to a normalized model, you are either choosing the least common denominator or you’re making certain interpolations or adjustments to the data to fit that model. You’re going to change both of those things over time. You’re going to evolve your model, and you’re going to change what you’re extracting from your sources. Modern data platforms, in my mind, expose the economics of each of those changes over time in ways that they are both incremental, so they’re faster, they’re cheaper, and they’re more transparent. But by doing that, it means that if you’re now thinking as an organization, even the subset of your organization that deals with this data platform, you can now respond to change, new requirements, other opportunities in a way where you have a new language. Your language says, “Oh, we don’t have to go deal with that big, scary, custom-built ETL pile of code, we actually understand that there is an engineered process with a discrete unit of work that lets us adjust the transformation to get a little bit more resolution out.” You can answer questions like, “How do you trace back from your last mile model and say, Okay, why should I trust that this is the revenue per customer? How do I know that? Where did this come from?” And obviously this is what I meant when I said there’s always more we can do because of that kind of visibility and traceability.
What you’re saying is that you want to have a pipeline that has as many different stages in it that can separate the work to natural groups so that if it comes time to change something, you’re changing it within a smaller group or a smaller object. My vision of the semantics is that it’s toward the end of that pipeline that you’re starting to have a model, and then the semantics are describing the relationships between the data in that model so you can start making inferences about if I know you have two children, then I know that two children of the same parent are siblings.
Yes.
That’s not in the data anywhere. But it’s in the semantics.
With the caveat that my real-world experience shows over and over again that when you get to the point of having that in your model and you want to make that inference, you discover just where the weak spots, the gaps, the inconsistencies in how you got to that data live. And then it’s a process of going back and either improving the source or improving the transformations. It’s not just that I want to split them all out. I’m splitting them out because there are now open source commercial products that commoditize the delivery of that functionality.
So you want productization.
I do and I’m being irresponsible if I’m sitting down and writing something cool when there are cheaper, more reliable, more mature solutions out there. Your ability to cost-effectively scale, reliably scale and apply all the tools that we’ve learned from migration to the cloud to this problem, is extremely compelling.
To what extent have you seen this architecture in practice blossom and provide value—the theory’s compelling but the practice is really hard to see it come out?
You’re right and you and I both earned our gray hairs by learning this lesson the hard way. The problems of technology is always greater than what it can deliver in the time frame everybody thinks it’s going to deliver. The long-term impact is greater than we anticipate, but it takes a lot longer. In this case, I would focus on certain very specific benefits that I’ve observed. The most successful migrations of analyses that I’ve experienced directly, from what was done in a desktop spreadsheet to a cloud data platform, are the ones where that spreadsheet was what ran a chunk of the business. At which point you now have a clear specification for what the analysis is. And you also have a compelling business case in terms of human effort. If we could automate that down so things that take a couple hours for a human, that begins to pay for a fair amount of investment.
With Warby Parker, I wrote a story about its transformation, where they were running the entire business on spreadsheets that would extract data from their ERP system and then do a variety of transformations on them using V lookups and things like that. And eventually it got to the point where the spreadsheets could no longer process, in the time allotted, the data and they would break often. So what became necessary to do was to take and move that into a data warehouse, and they used Looker to do that. The problem was that as soon as they started doing that, they found that there was a spreadsheet in customer service that identified the customer one way, a spreadsheet in finance that identified the customer another way, a spreadsheet in marketing and each of them had a very rational definition of the customer. They realized they had to create a new semantics so that they could describe very clearly the customers that needed to be used inside the data warehouse so that then when they had a report you could say, “Oh, this is the finance customer,” or “This is the customer service customer.”
I think that’s a good example of when you had such a clear and compelling business case that it justified the considerable investment to do that work. Because you now have different tribes who have their own definitions and you’re going to get them all to either speak the same language or agree to a transformation to a normalized model. A lot of mistakes that we’ve seen are where people try to do that work without the compelling business case, at which point you’re kind of doomed to failure because it’s going to be painful, it’s going to be expensive, and what it produces isn’t going to have clear and compelling business value.
In your consulting practice right now you are focused on every engagement over and over in building this modern data platform. When you come in to a client and they ask, “Dan, what should we be doing?”, how do you explain the value of what you’re building when a modern data platform is the right choice?
Any time you have to convince somebody of the value of making a change, you’re in trouble. What I’m doing is I’m listening. You say, “Do you have business-critical decisions that you need to make that depend on understanding data from a combination of systems, some of which might be yours and some of which might be third parties that you use on a subscription basis? Are you able to pull that data together today? Are you able to run the analyses you need? Is the time required and the cost of doing that acceptable, or is it limiting the business from responding to opportunities and challenges?” I have yet to meet a CEO or a CTO or a COO who says, “Oh, yeah, I can get answers to whatever I want, whenever I want, as often as I want, as cheap as I want.” It’s never going to happen. If you talk about it and they have several other situations that leverage the same data, like in a bank, some of it could be their core banking system but it feeds regulatory reporting, and also feeds customer profitability calculations, you start to see this concentration of potential business value that you can unlock if you can make the data available, make the evolution of new analysis cost effective and fast, and if it doesn’t require PhD level scientists to do so, then you got a shot at doing it. The truth is for most, there’s a lot of low hanging fruit that honestly, getting the data and being able to run basic, rectangular queries against unified data with confidence that the data is right, yields tremendous value.
The idea is you don’t need to do an AI/ML data science implementation, you need to finish your data warehouse implementation and do a good job of it and then that provides the foundation?
The job is to find which of your data points really matter.