Why Data Supply Chains Matter: A Q&A with TreeHive’s Donald Farmer

The data supply chain has been a central focus of Early Adopter Research and Dan Woods. Recently, Woods had the chance to speak with TreeHive Strategy’s Donald Farmer for the Early Adopter Podcast. Woods and Farmer have worked together in a variety of capacities in the past, but their discussion for the podcast focused on data and analytic strategy.

Woods’ theory of the data supply chain is that companies are entering a world in which people are going to be using products to create their infrastructure. He has asserted that most enterprises are in the business of creating a product-based platform and when that vision is applied to analytics and AI, companies are going to select products and become masters of using them. There are going to be two types of products, those products that are intended to build analytics and data infrastructure, and those that are intended to do something important for the business. In both areas, companies will have to answer the question of how they are going to differentiate themselves? Likely, they can become better at using a productized AI or a productized analytics and BI infrastructure by mastering their data and building and running a data supply chain. That data supply chain allows companies to find all their data, clean it up, prepare it, move it, combine it into different forms that will support workloads, and then deliver it to those products. This will lead to a better result from both analytics tools and from the products that companies buy that are reliant on data and use AI.

That is the starting point for Woods’ and Farmer’s conversation and this is an edited version of that discussion.

Woods: What is your reaction to my theory that the data supply chain will be a differentiating factor in whether companies succeed?

Farmer: I have a long history in data and analytics. Most recently, I was with Qlik. I was with Microsoft for about 10 years building data and analytics tools there. And now I’m independent, helping companies with their data and analytics strategy. I think the idea of using tools is very important. I think the supply chain is critical. My theory is that every company I talk to is a data company, whether they know it or not. Most of them know that they’re a data company. If you’re a data company, you need to be an analytics company. If you’re an analytics company, you really need to be looking at being an advanced analytics company in order to get an advantage out of the data. So my work involves helping companies move from being a data company to an analytics company to an advanced analytics company.

What’s interesting to me is that I believe that you could create a data relevance score for different companies. At the top of the loop are the companies where data can not only inform decisions, but also create models that can control business behavior. Do you have different advice for people based on the idea of a data relevancy score?

Yes. But I push back on the idea that there’s one score. What you’d find in a company is that different departments or different divisions may have radically different scores. And that’s important to understand. So for example, the creative team in a fashion company may have a pretty low score when it comes to data relevancy and data literacy and how they’re using data because they see themselves as primarily not working in that space. On the other hand, the supply chain team, who are actually dealing with sourcing the materials and the outsourcers who are going to actually make the products, create the fashion. The shipping companies, the logistics companies, are how they get things into stores on time, because fashion’s a very time-sensitive business. Those people may be radically informed by data, and may be deeply engaged with the data and deeply data literate.

So it’s not one score. It’s a mix of scores. There’s some parts of the company which are deeply informed, other parts which aren’t. So part of the secret from my point of view is finding not where the company is as a whole, but where it is in different sectors and in different divisions in the company, and then working out where each of these divisions should be. So it’s not as if there’s a single roadmap for the company. There’s a set of roadmaps, which of course should then form a coherent strategy as a whole.

What do you do when you arrive at a company and what are the phases that you go through in trying to help them?

The very first phase is actually just helping to identify the data. What data do you have, who’s using it just now, who could be using it, and is the data fit for purpose? That’s pretty important. And for me, the critical part of this first stage is enabling a change in mindset. And this relates to your data supply chain concept. Enabling a change in mindset for the IT department so that they are no longer data gatekeepers. Instead of just being the people who control access to the data, the IT department has to proactively supply data which is fit for purpose. And by doing so, not only do they enable the business users to take full advantage of the data, but to a certain extent, they relieve a burden on themselves of having to manage every aspect of this. Because by creating data which is already fit for purpose and has already been through all the various corporate governance controls, then they can throw it over the wall to the business users and say, “Have at it, do what you need to do, because we’ve done our job of ensuring that it’s clean and making sure that it’s relevant, and making sure that it’s got the kind of various security and compliance issues.”

I think in almost every data pipeline that you see, there’s several levels of data. There’s the raw data in the source. Then that data is somehow moved or made accessible to an environment in which it can be used. And once it’s either moved or made accessible, I call that the landed data. Then you go from the landed data to creating coherent, useful objects which I call the model data. Those objects are not intended for a particular purpose. They’re intended to be generic useful concepts. These become the verbs and nouns and language that you use to describe. There may be multiple objects in the landed model form. And then you go from that to the various layers of purpose built data. And it seems like what we’ve been trying to do in a lot of these efforts is to try to pull back and allow the end users or people close to them to go back deeper into the data and to get as close as possible to the modeling process. And sometimes that works and sometimes it doesn’t. What I’ve found is that there’s an inherent complexity to a lot of this data work that really is computer science work. So how do you advise people about this tradeoff of where do you put these layers of transformation?

Well this is where the second phase of what I do comes in. So the first phase is identifying the data and handling this change in mindset from being gatekeepers to shopkeepers. The second phase is if you’re a data company, you need to be an analytics company. And part of that, and an important part of being an analytics company, is understanding the tools that are available to you and what tools are appropriate for the business. Because if you’re dealing with people who are business literate but not data literate, then you have to have a tool which enables them to become data literate, which enables them to work in business terms, but have the tool handle a lot of the data for them, or the data capabilities. Luckily, the BI platforms are getting more and more complete. They’re getting more effective for a business user in order to do analysis. In fact, they can guide the user through some analytics and through some capabilities, because you don’t want the user to have to worry about things like the cardinality of joins or complexities like that. The BI tool should handle that.

So the second phase, for me, is very often as you move from being a data company to being an analytics company, what are the tools and platforms that are going to be relevant to your business? Now, sometimes that’s one tool, and for some companies, that could be one tool that covers a wide variety of scenarios. Other times, it’s more than one tool. Sometimes finance needs different tools from a creative team. And that’s fine, so long as that’s done in a governed and managed way and is done with purpose, not just because it randomly happens.

Have you seen that most people who you consult with end up with some sort of cataloging layer or do they end up with multiple cataloging layers?

Increasingly, they’re ending up with a cataloging layer. A few years ago we would have talked about an MDM, a master data management layer. I guess we’d also have talked maybe about having a data warehouse. But increasingly, catalogs are taking on that role. Because very often, the data warehouse is there in order to not only provide access to data within this metadata, but also to impose a model. Catalogs don’t so much impose a model. They don’t necessarily tell you how all this data relates in terms of calculation, but they do provide you with the business metadata that helps you to understand the data and its potential uses. So that’s very, very powerful. Do I see multiple catalogs? Well, yes, but not in companies which have taken the governance seriously. One of the issues about governance that people overlook is that governance isn’t about getting the right answers, it’s about using the right processes.

So give me an example of the bad governance and the good governance.

Right. Let’s say we want to do a targeted campaign for a particular group of customers. So good governance would be, “Here’s the people we’re allowed to target. They’re permissible, they’re well-chosen, and we’re allowed to contact them and run that campaign built around that.” Bad governance would be, “Hey, here’s a really effective campaign we can build with a great data model, but you know what, we just grabbed that data out of the system and nobody really governed whether it was allowed or not.”

On the one hand, you may have a very accurate model precisely because it’s got information that you’re not supposed to be using for doing marketing because you’ve added more attributes, maybe attributes about gender or race, or attributes about all sorts of things that could be in there that are not permissible for you to use. You built a better model, but you’re not allowed to use it. And, in fact, if it turns out that that model makes some mistakes or it turns out that somebody discovers you’re using that model, it could be trouble for your company. So the governance is poor, even if the model is good. On the other hand, you may have a model that is reasonably good but well-governed. And in business terms, it would be better for you to use that.

Got it. One of the things that I’ve been trying to understand is the level of productization of some of the tools and the analytics that you get. I created a framework for productized analytics where I talked about the most productized form of an analytic, which I call the “value meal,” like at McDonald’s. You go and you say, “I want this.” You get it. And then the second level is like the artisanal brew. This is going to Starbucks and going, “Hey, I want my skinny decaf latte with extra foam and high temperature.” And then below that, you have dinner in a box. This is like Blue Apron. You get the ingredients, you get guidance about what to do with them, but you assemble them and finish the job. And then the final is the custom kitchen, which is where you’re given a platform with raw materials and could do anything. It seems like that the highest value comes from, in most organizations, really good value meals that affect a lot of people that were previously not using data. But product development, product management is hard. So how do you advise people about where to spend their time on these different levels of productization and how to gradually grow competency in handling and understanding where each level applies to their problems?

That’s great. This really comes into the third phase of what I do, which is when you start looking at advanced analytics. And one of the key elements of advanced analytics is to understand the difference between tools and techniques. And what I’m looking at there typically is are your data teams, especially your advanced analytics teams, what are they going to be focused on? Do they want tools or techniques? Many data scientists are focused on techniques. They’re focused on the algorithms, and they don’t particularly care about the tools. I’ve met people at companies that are highly competitive with Google, who use Google TensorFlow, because that’s a great algorithmic set for doing particular forms of learning. And they will use that even though they see Google as a competitor because that’s the best tool for the job. So they’re focused on the technique. Rather than necessarily, you know, “I want to choose a tool,” they say, “I want to choose technique and what supports that.” But as you get to different levels of skill, actually having a tool which encapsulates a set of techniques can be very valuable for you. You may know the techniques, or you may know of the techniques, so you understand regression and you understand clustering, but that doesn’t necessarily mean that you are fully able to use them in a freeform way.

It seems to me in these evaluations, you have to be sophisticated enough to go, “Hey, I’m not going to have a coffee with half-and-half and lemon juice.” And know when it’s not going to work out.

One of the issues with products is whenever you build a product, you’re always enabling some scenarios and closing off others. And in fact, that’s almost the nature of productization is finding what the user is going to get the best value out of and let’s focus on that path. And very often, that requires you to close off other paths for the user, that they can’t go there. One of the key things for defining a product is what’s the negative space, what are you not going to do? Because you can do anything. Often your strategy has to be around defining that negative space of where you’re not going to go.

Let’s put some meat on the bones of this idea. What’s a good positive and negative example of choosing and analyzing negative space?

For example, a business user tool focused on analytics which has some basic data preparation capabilities built in. And increasingly, there’s some data preparation capabilities built in. But that’s not the same as a full ETL tool, a tool which is capable of running multiple parallel streams, which is capable of optimizing across multiple servers how data is integrated, which is capable of performing very, very high level of aggregations and scaling that out across multiple servers, and is capable of scheduling that process and restarting it if it breaks and providing all sorts of error conditions for handling it. That’s a very sophisticated tool. There’s a really important need for it in enterprise data management, but it’s not the tool you put on a desktop for somebody who just wants to pull a few data sources together and get analyzing. On the other hand, you might be looking at ETL capabilities for your enterprise servers which are really sophisticated. Do they have an easy to use business user interface? No. Well, that’s fine. You don’t need to go there because it’s not intended for a business user, and that’s okay.

When you get to the point where you’re trying to introduce advanced analytics to a wider group in a company, you have this sense of magic where you’ve created a system that now gives me a prediction or a recommendation, some sort of important signal that somebody has to have confidence on and act on. And I find that many of these systems do not have much transparency. This a problem that’s being worked on quite aggressively in AI, but even if you have a system that is theoretically transparent, the actual mechanisms are really hard to explain. How do you build trust in these advance analytics so that they actually become deployed and used?

So the trust question is a very important one, and I think there’s three ways in which trust can be built. A friend of mine, Scott Davis, once said, very insightfully, that people don’t trust data, they trust other people. What he meant by that is if you get data in a report, you don’t analyze every calculation in there. You don’t track back its lineage back to the original SAP system or whatever. You trust that the person who built that report did a good job. Or you don’t, in which case you use another report. But it’s ultimately a human trust. Now, if you apply that to machine learning, how do you trust it? To my mind, there are three ways in which you could trust it. The first is if you actually get the technical explanation, which in some cases is super difficult to do. How do you explain how TensorFlow came to a specific prediction? Very, very difficult to do that. But in some cases, algorithms can be explained and there are various techniques for explaining algorithms.

The second explanation is what I call a relative explanation, which is given a set of circumstances which are similar, this is where you fit in. For example, somebody complains that the computer has said, “No, we’re not giving them a loan that they applied for.” How would you explain that? You might not be able to explain the algorithm, but you could actually run a Monte Carlo simulation of all the inputs, randomly varying all the inputs, all the variables that were assessed, and say, “Hey, you know, in 40% of cases you got the loan, but in 60% you didn’t.” That doesn’t give you an explanation of why you didn’t get it, but it does show that generally speaking, you were not eligible for this loan.

The third way of trusting something is actually more related to human trust. You effectively would develop a theory of mind about how the machine is working. In the same way that I have a theory of mind about how Dan works and I understand the way in which you think, and therefore I’m likely to trust or not trust various kinds of predictions that you might make because I know how your mind works. In the same way, just through experience we’ll trust some of these systems because we learn how they work under different circumstances. I don’t understand, for example, how Google Maps algorithms work. I have no idea. But I know my neighborhood pretty well, and I kind of trust it because I trust the results it gives.

I’ve heard Marketo about their lead scoring say that, “We don’t have to worry about how the lead scoring was. All we have to do is put the lead scoring on each lead and then let the salespeople see whether it matters or not.” And frankly, most of the time they realize that it does matter and they start using it really quickly because they like the fact that they get better results with the higher scored leads.

Right. And they learn to trust it in a sort of intuitive way. They understand how it works without understanding all the details.

So let’s just end with a speculative advice sort of question, and that is when you encounter a new client and you start talking to them, what do you find are the myths about data and analytics that you have to dispel quickly in order to start making progress?

The very first myth is that their data is good enough or that they have the data that they need. Very often, you dig into it and their data is poor quality and it’s not well-governed and it doesn’t have the insight that they think they’re going to get from it. The second problem that I come across is that very often they have a misconception about the role of data in their organization. They don’t think of themselves as an analytics company, or sometimes they don’t think of themselves as a data company. I was talking to a paint manufacturer recently who said, “We have data, but we’re not really a data company, we’re a paint company.” I said, yes, but you realize that actually there’s a tremendous amount of value that you can get from your data. In fact, data could almost be another line of business for you. And they said, “How would that go?” Well, selling data back to your suppliers, for example. Or building a supply chain where you were actually informing your suppliers with data about what it is you need, and therefore you can improve their efficiencies and selling or monetizing or in some way productizing the data that you use with your supply chain in another line of business for you. And that actually turned out to be a pretty successful project.

The other myth is about artificial intelligence. People see artificial intelligence as something that may happen in the future. “Oh yeah, we’re not quite there yet.” And I keep having to tell people that you’ve got artificial intelligence in your pocket and you use it every day. It’s there on your cell phone. The idea that artificial intelligence is out there in the future and is actually out of scope for you is simply not true. It’s a daily part of your life. It’s a daily part of your business, and the very fact that you don’t understand that is something that is getting in the way of your optimizing your use of technology.