Third-Generation BI: A Q&A with Nick Caldwell and Frank Bien
On a CTO Club list that I’m on, a question came up. “What are you using for BI?” About ten or fifteen people said, “I’m using Looker with Redshift” or “I’m using Looker with Snowflake.” Looker has become popular in the cloud data warehouse space. I’ve been following Looker for a long time now, and recently have been considering the combination of Looker with new paradigms for accessing data, including Dremio.
It’s interesting that all of these new paradigms for accessing data are being connected to Looker. For example, consider the combination of Dremio with Looker.
I spoke with Looker’s CEO Frank Bien and Chief Product Officer Nick Caldwell recently and we delved into the topic of third-generation business intelligence.
Dan Woods: Nick, what is your role at Looker?
Nick Caldwell: I started as Chief Product Officer about a month ago. With that title comes responsibility for charting our strategic roadmap and building an organization that can execute on it effectively.
Prior to that, I was VP of engineering at Reddit, and before that, general manager at Microsoft running the Power BI organization. I was one of the founding members of that team and helped with most of the products in that family.
You’ve got a good 360 view of the BI space, both as a user and as a vendor.
Caldwell: At Microsoft, I had a very inward-looking view of the BI market. The answer to many of the questions we would get from customers was, “Buy more Microsoft products.”
When I left Microsoft and went to Reddit, I was responsible for building their BI practice from the ground up. I was exposed to many of the technologies that you mentioned: Snowflake, BigQuery, Dremio, etc. This changed my perspective dramatically.
I think what you’re seeing in the marketplace right now is rapid advancement at the data layer, which is becoming incredibly powerful. And one exciting thing about Looker is that it’s well positioned to take advantage of that change. It is ecosystem agnostic. It’s flexible. The LookML modeling layer does a phenomenal job of not just creating a governed data model, but also ensuring that the resulting queries are performant. Many other vendors haven’t made that investment.
Frank, why don’t you explain third-generation BI analysis?
Frank Bien: I came at this problem from the big database world. I was at a company called Greenplum, which was one of the big MPP databases, and I had played in other big database worlds before that. And when I met Lloyd Tabb, the founder of Looker, and saw what he built and the modeling language, which is where the core IP started from, I said, “This is it. This can unlock the value of these big, fast databases.”
I think unlocking the value is what we’ve seen with our customers over the last few years, from early ecommerce companies like Deliveroo and BuzzFeed to bigger companies like Amazon, Cigna and Verizon.
Why have we seen this success and why is it interesting? It’s the idea of the third wave. BI and analytics companies started with companies like Cognos and BusinessObjects. Those were awesome monolithic platforms that allowed us to collect data and present it in the form of reports to users at a time when databases were very expensive.
You had to be careful about the kind of load you put on your databases, so you used business intelligence tools. They offered a platform that could suck data out of the database and put it in a cube or a cache and run analytics on top of it. And you were careful as you moved that data around, because data has weight, and systems and networks were slow back then. And then those platforms started to become a little more onerous, and although they were very accurate, they weren’t empowering the masses to operate on data.
So we saw the second wave, the proliferation of self-service point solutions, including tools to do visualization and tools to do data wrangling and data cataloguing. And we put those components in the hands of end users, and as long as they had their data and could put it in these kinds of systems and understood the information, they could get a lot of value out of it. That really helped people become a bit more data-driven.
But I would argue what happened from that second wave was a lot of pieces of a solution but not a solution itself. We really ended up in a bit of a mess, and when you went into an organization you saw data brawls. A company would have three or four departments all using data, and they would describe the same data metrics differently. “What’s the lifetime value of a customer?” One person would say $14,000, and the other would say $52,000. They were calculating the business logic completely differently and that was a problem.
What I saw with the early databases was that we were moving out of a formal data warehousing world into a world where any size company could collect and store all of the business data that they had in their organization in a much more efficient, effective way. And that came with cloud databases and with Hadoop, and the big data craze that was happening.
What we’ve seen in the last couple of years is large organizations being able to collect and store vast amounts of information, but they didn’t have tools to put on top of that. The tools were still operating as point solutions.
So wave one was the data warehouse. Wave two brought special purpose tools for different parts of the data supply chain, with tools for data discovery like Qlik or Tableau as well as tools for data wrangling and a variety of ETL processes. What are the characteristics and enablers of the third wave?
Bien: The third wave is about data infrastructure that is inexpensive and easier to deploy. Now I can collect and store petabytes of structured information really easily. I saw it first with Hadoop, but what really blew the doors off was the cloud database push. First it was Amazon Redshift and Snowflake and Google BigQuery and Microsoft. But we really started to see the ability for companies to start to mirror all of their Salesforce.com tables and the data from your marketing automation systems like Marketo and support systems like Zendesk. All of the data from these internal systems could be landed in these big, fast databases. But the nuance was, to really get the value out of it, these systems needed a different tool on top.
What I had seen early at places like Greenplum was transformation at query time. So you could leave the data in a mess in the data lake, and then right before you wanted to get value from it, you could make sense of it and create the data model on the fly to transform that data late in the process. And that was much more agile, because the first and second wave were about heavily manipulating data like plumbing and the supply chain was really difficult. But the promise of big data was a much more agile world.
Looker is a platform to go on top of these big fast databases and make sense of all that data, adding agility, so companies can do that late binding transformation.
That started to solve two key problems. People could rely on the information because it was a single source of truth. We had a big data model not on a silo of data or a bunch of different data models, but one big cohesive model to describe all the data in this data lake.
And then what we provided on top, through our visualizations and our BI application, was access so people could get at the data. By solving the underlying problems of access and reliability, we now saw people using data in a much different way.
Let me just sum up what you said. First of all, in the run up from the wave one, the data warehouse, and the wave two, the data discovery world, you had the ability to do things like schema on read. But a lot of it was about creating a purpose-built extract for a specific dashboard or a purpose-built extract for an analytic. That purpose-built extract really only had the scope you needed for that specific dashboard. It didn’t have the ambition or the ability to actually solve the problem and create a reusable model. It was very similar to what happened in the era of the data warehouse, where you would extract into spreadsheets and work there.
The difference in the Looker solution is that now you can create several different layers of a model. You can land the data and create landed data in tables. You can, on top of that, create reusable objects of various types, and then on top of that, you can create purpose-built specific models. But all of those become reusable and they’re not tightly linked to one dashboard. Thus, when you start entering the model, you can explore it. You can go wide, you can go deep, and you can go far deeper and far wider than you could in a world of extracts. And of course that changes how you use data.
Bien: Exactly. I think it’s a nuance, but when I’m operating in silos, every time I want to change the subject of what I’m asking a question about, I’m resurfacing and going to a different world. And it becomes problematic, and those worlds don’t connect very well.
Think about how you work when you do your job. If you’re a purchaser at Walmart, responsible for a certain type of product, you want to understand supply chain information. In Looker, I’m looking at that kind of information. But then I want to start to see sales data about that, so then I’m looking at sales data and bouncing around, like Google Earth, where I can zoom in and zoom out and move over and zoom over on a different geography.
And I can do that when I have one large data model looking at lots of information. The whole world of BI in the first and second wave was built around silos. But what happened with this advent of these big fast databases is I don’t have to be locked into that thinking anymore. That’s really where Looker is unlocking a lot of value: in the ability to look at much larger, broader sets of information and make use out of them in a more meaningful way.
What’s been most surprising when customers start understanding this?
Bien: It’s been fairly consistent, and it’s one of the biggest surprises that I had personally at Looker.
We were building data models, pretty technically deep stuff. And almost universally our customers come back talking about how Looker has changed their culture or made them more data-driven. It’s surprising that a fairly technical product is solving issues of access and reliability and getting data into the hands of regular people in the business — not the sophisticated technologist.
Our customers come back and describe how we got them over the hump of creating a data culture where people can answer any question and make better, more informed decisions in real time.
I’ve always thought that these technologies, when they’re successful, lower the curiosity tax. You don’t have to pay such a high tax to answer a question and so your curiosity can be rewarded more often.
You’ve seen a lot of technology implemented, and Looker obviously grows at some companies, and stalls at some others. What mistakes do you see people making? Because we’ve never had so much choice about different ways of creating what I call a data supply chain.
Bien: One mistake is they’re making the data supply chain too complex. Forget about Looker for a second. The promise of big data and cloud databases is we can approach the problem in a different way. The companies that are successful are internally approaching it much in the same way that a Google or a Facebook or an Uber has been approaching these data problems internally for a long time, which is to put a lot of data into systems that are big and fast and cheap, and then open them up broadly to people.
What I still see is lots of companies who mess with the plumbing and manipulate data heavily, so that when you change the question, you really have to go back and engage plumbers to change everything around again. They’re putting wave two tools on top of these big fast databases and they’re not seeing the value out of it.
Where we start to see value unlocked is when we take this broader approach and providing access across entire data lakes. Pretty consistently we see sort of this old world/new world franken-stacks being created that are far too complex. The promise of big data and big fast databases with Looker on top is a much simpler stack and a much more efficient supply chain.
Another cultural issue a lot of companies aren’t ready for is the level of transparency and data access at places like Google or Netflix. A recent article in The Wall Street Journal about Netflix culture said that almost everybody can see almost all the data. There are very few companies where that’s true. Now, obviously, some of those companies are regulated, but even in those companies, there can be much more transparency. That’s a cultural value that is hard to change.
Bien: That’s the value that Looker started from. I don’t know if you know, but one of the founders of Looker, Marc Randolph, was the founder and CEO of Netflix. So he and Reed started Netflix. And what Marc brought to Looker early on was the idea that data should affect the culture.
I’ve always liked LookML. It’s a real computer language with concepts equivalent to object-oriented programming. You can create a data object, subclass it, and reuse it. You store it as source code in a repository with version control.
I had a great time talking to Warby Parker about their Looker implementation. They rounded up all of their different concepts including customer definitions and different ways of looking at the product. They coded all that up in LookML, documented it, and it became the foundation of their company.
Now, on top of that, you’ve put visualization and exploration. Underneath LookML, you’ve put a SQL query generation system that allows you to access to data quickly and responsively as people navigate through it.
Nick, how would you describe the portfolio in the same sort of layers that I’ve talked about? What’s on top of the model, what’s the model like, and what’s on the bottom of the model?
Caldwell: Looker has classically been targeted at people writing LookML. That is the most powerful layer and everything is built on that. Moving up a layer, we’re increasingly trying to invest in data exploration, either ad hoc SQL queries or ad hoc visualization exploration. Increasingly we see even higher, more abstracted levels of use, with people who aren’t necessarily familiar with the data model, people who may be not even using dashboards. They’re just using data in their day-to-day lives. Very recently we’re starting to invest in what we call Looker applications, and these are purpose-built point solutions for people who in some cases have no familiarity with the guts of the system, but they are just trying to get something done. his could be experiences for people working in a factory who need to do an inventory lookup, or DevOps engineers who need to monitor real time systems.
We’re also increasingly seeing data used inside other business productivity applications. One of our favorite examples is Slack, with a Slackbot where salespeople can look up, with a quick Slack command, statistics on sales for a particular customer.
Those are the layers I think about, and each one is targeted at a particular persona. The lowest level, LookML, is for folks really familiar with the underlying data, and then up a level are people who want to explore and slice and dice data, and then at the very highest levels, which constitutes the majority of users for the future, are people just trying to get work done.
This is very similar to a framework I came up with a year ago called the productized analytics framework where at the top you have a value meal, which is the most productized form — like going to McDonald’s and ordering something. Underneath it you have the dinner-in-a-box where you can make choices like, “I want my skinny half-caf, almond milk latte.” And then underneath that you have a custom kitchen where you are writing LookML. And so what’s interesting to me, and the thing that always bothered me a bit about Looker, is that LookML could generate SQL, but that SQL usually was linked to a certain database that had in it a range of tables. I often thought, how can we get beyond that? How can we get to a federated query? The problem is that the sophistication that you use to generate that SQL is really high, and you relied on a good engine to be able to execute that in the database.
It wasn’t until I started studying Dremio that I realized that by going against a system like Dremio, which allows you to put SQL on top of a very wide range of repositories and be able to JOIN across those repositories, the range of what you can model in LookML becomes vastly wider.
That’s true to some extent when you look at how things work in Presto, where you can have data that is in S3 objects of arbitrary format included in queries, or in Snowflake, which does the same trick and allows you to reach out to data that’s stored in JSON. So what is below the LookML layer to allow the range of data that can be included in a Looker model to be increased?
Caldwell: It’s funny you mention Dremio. They were at our JOIN conference a few weeks ago and I got some time with them and talked through their solution. That whole class of technology is pretty exciting. I think from a market perspective, people are increasingly throwing data into lakes. Additionally they may be purchasing SaaS apps, and behind those apps, data’s sitting in S3 buckets or in BigQuery. The idea that you could have a single unified query layer that would sit on top of those underlying lakes and just query in one place is very powerful. We are very interested in this and we have internal teams looking into technologies that will allow us to accomplish similar things. The point is it allows us to simplify how we treat the underlying data. We let the data sit where it is and then one level up, we’re clever about querying it.
On top of that, it allows you to insert an intelligent caching layer. In the future, as we develop these technologies, the layers above LookML all get a little bit more intelligent and a little bit more performant, without you having to do anything. From a customer perspective, it’s simpler. You let your data sit where it is. From an end user perspective, it’s faster. Your data is cached based on what you use, resulting in faster dashboards and faster exploration. All of those things are pretty exciting to me.
Do you see any other priorities that are important?
Caldwell: I can give you an overview for every one of the levels we talked about. At the LookML level, we briefly talked about performance and the ability to query across multiple stores. One other thing I would call out is that as Looker gets adopted in the enterprise, people are coming up with very complex LookML models. We’re investing heavily in a model validator, so as your model gets more complicated, you can ensure that changes to the model don’t impact any dashboards or reports that you’ve created on top of it negatively. And the Validator code is something we’re focused on optimizing so that larger enterprise companies with complex models don’t have to wait a long time to validate the work that they’ve done.
What kind of an error would a model validator catch?
Caldwell: When you change the underlying model and a dimension changes, a model validator might tell you let you know what dashboards are impacted by the change.
You want to be warned about that. Looker provides this governed data layer, and one advantage of having that central layer is visibility into all of the dashboards and reports that any change will affect. If you’re using like an extract solution, like Tableau or Excel workbooks, if you make a change in the definition of data in one place, someone has to manually track down all those changes. Looker lets you do all that in one place. But increasingly we also need to do that in a performant fashion because we’re seeing so much adoption in the enterprise.
One other thing that we have to do at the LookML level is increase time to value. There’s power that LookML gives you, but the challenge is understanding that value quickly. It’s a steep hurdle to learn a new language to get that value. One thing I’m thinking about is how to accelerate the time for understanding the value that LookML gives you from hours or days to minutes.
It really hit me when Lloyd, our CTO, sat down and in about three minutes wrote some LookML that replaced work that took my data engineering team at Reddit three weeks. It was a lightning bolt of insight. And if we can package that insight into a nice five-minute onboarding experience, I would love that.
Pushing up the stack up a little bit, into exploratory experiences, I am of the opinion that we do a great exploratory work for people who are familiar with the model, but I would like to make that even simpler, so that people who are coming into Looker through a dashboard or a shared link, can also have a really pleasant and beautiful exploration experience.
And we’ll hit that on two angles. One is performance. I’m really pushing the team hard to triple down on how fast we load dashboards, and once they’re loaded, to have dashboards feel faster so you can slice and dice and cross-filter in place, rather than having to go to the explore views.
The other angle is beauty. We invested early on in the LookML layer, and now we’re adding in the BI layer. It’s very important that we offer dashboards and reports that look pleasing and are fun to share. So we’re going to also be investing in the quality of our visuals and the quality of our layout.
One of the things that’s most exciting about using LookML is that you’re never trapped in the visual. Some BI tools do a great job of summarizing data and presenting it visually, but then you go, “Hmm, let me look at that.” With Looker you can always get to the underlying table.
Bien: we built this Looker with all of these components on top for governance and visualization and cataloguing and exploring, on top of these big fast databases. We address the full spectrum. If you think about data having a past, a present, and a future, BI has been about the past. It’s about looking in the rearview mirror and doing the analysis to see what happened last quarter and how we can do a little bit better.
More recently, people have been focusing on the future. How do we bring AI and predictive analytics into the toolchain? Looker’s done a lot of work with companies like DataRobot and BigQuery Machine Learning at Google as well as with others, seeing how that fits into the toolchain.
More and more, what we are finally seeing is the value being unlocked so that we use data in the present. Now that we’ve like solved problems of access and reliability, we’re seeing people actually lightweight applications that help them in how they do their jobs every day. That’s the trick: to actually put data into the hands of non-technical users so they can get value out of it, and at the same time service the data experts who want to do deeper analysis.
How are you supporting use of AI and ML outside the product? I saw a lecture by Peter Norvig where he talked about the importance of debugging data. In AI and ML systems, data is no longer just descriptive; it’s the source code that’s going to determine the behavior of the system and as such it needs to be debugged. If it has biases or errors that affect the outcome, you’re going to want to figure out what they are and how to remove them.
Bien: The issue with AI and machine learning is always source data. Data scientists routinely complain that they spend 90% of their time coming up with repeatable sets of data to feed into their models. And that, at the very core, is what we address. It’s this idea of coming up with information that can be fed into models, and then taking the results of those predictive analytics and feeding them back into the application that the business user is consuming. We’re trying to solve the heavy lifting that data scientists are having difficulty with and giving them tools to represent data into their models.
Also, there’s a big trend in data that people aren’t talking about much. For a long time we’ve talked about business intelligence and this horizontal, generic use case of data and of dashboards. But the world has changed, and there’s a new class of application emerging in the enterprise, and it’s the data application.
If you look at the SaaS companies that have grown quickly, more quickly than a lot of the BI applications, consider Mixpanel doing an event in web analytics or GA 360 and Google letting you look at ad information. Or New Relic and Splunk looking at IT and compliance data in big environments. Those are also data applications, and I see a blending of these worlds. I don’t know that the horizontal use of case of business intelligence is a long-term thing. People will operate in these applications around data that feel more like how they’re working. Looker is the platform to build those applications. Our customers routinely build lightweight applications that wrap data around a specific problem like customer success or digital marketing. We focus a lot on simple use cases that people can get going with quickly.
In other words, you want to support advanced productization so it’s not just a simple value meal, but a whole environment, like Google Analytics, for a specific area of business activity.
Bien: Exactly. We want those users to be able to break out and go into another data application, similar to exploring data on the Web, where you can really quickly change your perspective.
Woods: Nick, what would you like to add?
Caldwell: BI is not just for data geeks anymore. In the past data engineers and analysts were geeking out over BI tools. Now it’s everybody. Everyone needs data to get their job done, and SaaS apps and point solutions are becoming the way to do that. For us that means meeting people where they’re at, and trying to break away from the idea that you have to go into a BI tool. We no longer have to convince people to use data in their day-to-day. It is just happening, regardless. That wave has crested. And we have to build the tools to meet people where they’re at.
Simple tools that get specific jobs done, as well as simpler and easier to use interfaces to put on top of data.
What do you have that enables voice interfaces?
We’re one of the few BI vendors that already has a semantic modeling layer, and that means that applying things like NL is very easy for us, relative to other vendors.
When you say NL, you mean natural language?
Yes; natural language. I use the acronym because for the last five years I have run teams that only used NL interfaces for data. And we did that without the power of a governed data model. That’s one more direction I think we could go in the realm of making it simpler and more accessible for people.
Nick, how do you see AI and ML being used inside Looker? You mentioned the idea of the guided experience, which is a very natural way to use AI and ML to understand what people looked at and then predict what they might look at next. And eventually, that can used to make suggestions. A lot of the tools had this idea that they would be able to suggest data to people. But in practice, when I’ve talked to users, it really hasn’t worked out that way. What do you see happening inside the system with AI and ML to improve the product?
Caldwell: There are two ways to look at this. One is that LookML and Looker can be a more standardized way to bring together data from ML systems. That is to say, give your data scientists and machine learning developers one view of your business data to build against, and that can be used to build new models which feed back into Looker. That’s a low level use to build cleaner models and cleaner datasets.
But there’s a higher level use, which I think you’re getting at. If you were to read Gartner or any of the industry reports, they talk breathlessly about using ML to automatically look at data, understand it, and make suggestions. What those vendors fail to understand is that you must have some semantic understanding, like the purpose behind the data, before you can make reasonable suggestions. Otherwise, the best you can do is just say things like, “This data’s got seasonality” or “Here are some interesting outliers.”
Many vendors that do that. But Looker, because we are increasingly pushing in the direction of semantics and purpose-built applications, there’s a potential to provide valuable inbox analytics. If we know that the underlying dataset is HR, if we know the underlying dataset is event-driven, that we can sit on top of that and provide valuable suggestions to help you get your job done.
So that’s a direction I would like to head going forward.
What you’re going to do is implement the equivalent of semantic standards like OWL to manage and declare ontologies on top of the data models that you have so that you can then use those for reasoning and power them with usage-based data to make them even better.
Bien: It’s not so much about the future as what we’ve already done. We’re one large environment with the equivalent of thousands of workbooks or spreadsheets in one heterogeneous environment. And we understand how users are interacting with the data as it sits inside of Looker, so today a company might implement a cataloguing tool to look at lots of different things.
In our world, we don’t have to do that to get the same benefit. We understand how people are using that data, so a lot of what we’ve done over the last 18 months is build a lot of this content organization and content suggestions into our product. If I’m in marketing I understand what the other people in marketing are using, and the most valuable metrics that they’re looking at. I understand the dashboards that the exec team is looking at and what key metrics they’re driving the business to, so I follow those. We see this ability to suggest data content, and get people using what is most relevant and really solve that cataloguing issue. That’s what we’ve already done, and you’ll see us do even more.