The Power of Spreadsheets: A Q&A with Sigma Computing’s Rob Woollen
In a recent episode of the Early Adopter Research (EAR) Podcast, EAR’s Dan Woods spoke with Rob Woollen, the CEO and cofounder of Sigma Computing, about designing enterprise platforms. Sigma Computing’s products can be a component inside an enterprise platform, as the company was created to bring the spreadsheet paradigm to the world of business intelligence and data analytics. This is an idea that has been explored by companies like 1010data, DataMirror, and to some extent Looker. In this edited version of the podcast conversation, Woods and Woollen talk about how Sigma Computing came to be and how the company can play a role in the modern data landscape.
Woods: First of all, why don’t you give us the elevator pitch for what Sigma Computing is doing.
Woollen: Sigma is a spreadsheet interface for your cloud data warehouse. Our goal is that you have an interface that you’re already familiar with and what’s novel about it is that everything you do in that spreadsheet interface is translated by Sigma into SQL. And so it runs against your existing cloud data warehouse. And our goal is that you can build even the most complicated queries in that interface. I believe we’re different from anything on the market today. And so while we do translate to SQL, our interface requires no modeling, it requires no knowledge of SQL, it requires no coding. And so our goal is that even a novice spreadsheet user can sit down and build a query that even the best SQL programmer in the world would be challenged to build by hand.
How did you get the idea for Sigma?
I had spent a little over six years at Salesforce.com. And among my roles there, I was one of the CTOs and I led the platform product management team. And as you can imagine, of course Salesforce is a cloud company and has all the data in the world about customer relations, and service and sales activities of all of its customers. I remember thinking during my time at Salesforce that despite having all of that data, it was still so hard to make business decisions without resorting to spreadsheets and tiny bits of data on someone’s PC. And so after I left Salesforce, I spent some time as an entrepreneur in residence at Sutter Hill Ventures. While I was at Sutter Hill Ventures, I met my cofounder, Jason Frantz. He came from the systems side. He was an advisor at the time at Snowflake Computing, which was building a cloud data warehouse. And it was at that point that we really saw the mixture of these two worlds. We saw this amazing bit of cloud data infrastructure that Snowflake and BigQuery and Redshift were building and then we saw the set of business users who were having all of these questions they wanted to answer and were still using spreadsheets for it. And so at a very high level, the idea was really to try to combine those two worlds. And that’s really the problem that we’ve been trying to solve at Sigma: leveraging that cloud data warehouse but marrying it to that spreadsheet interface.
What I find interesting about that idea is that everybody’s attracted to the immediacy of the spreadsheet. The reason spreadsheets are so popular is that analysts can get their hands on the data and manipulate it using formulas and massaging it into just the right format they need to then do the analysis. But the problem is, it gets very complex very quickly. The Salesforce data model has lots of tables but then the Salesforce product brings them together into a very pleasing user interface. But if you looked at the Salesforce data model, it would be dozens of tables that would have to be connected to make sense. They’re designed that way for efficiency but it’s very complicated to create queries—if you have the raw data. You said that you’d really like to empower a user to be able to make a query that could be very powerful without having to know a query language. But how are you solving the problem of query generation without letting people know they’re actually doing it?
I think this hits on a critical point in the challenges that face a lot of the existing BI tools today. We’ve had tools that have gone all the way to the self-service side, where people extract data and play on their PC and that fails in the modern world because the data is out of date or is too big to fit on a PC. And then the reaction to that is to go all the way to the other extreme, where the only data is data that has been centrally modeled and you can’t change anything unless it’s already been built into that model. With Sigma, we’re really trying to say the right place is to be in the middle. We want that interface to be accessible, so the interface has to be something you already know how to use and you already recognize the power of that. And I believe the only interface for that is the spreadsheet. But it also needs to be trusted, centrally modeled and have governance and security. You have to balance those two worlds. And that’s one of our core philosophical beliefs that’s different from what I’ve seen on the market today.
What is your method for allowing people to deal with the complex combinations of data and bring them into this spreadsheet paradigm?
Our answer is no second class interfaces. Let me explain what I mean by that. In many other products, you have one interface where you can do all the powerful things and then you have a second interface where they let users explore or do little tweaks on top. We think that is fundamentally broken. Our goal is that we have one interface where you can do everything, meaning you can build out reusable spreadsheets that other people can leverage and build across. We build out reusable spreadsheets that you can compose on top of. So just like you can reuse a view in SQL, you can reuse a spreadsheet in Sigma as a join or as a source. To take your Salesforce example, I can build out a set of sheets that build out, say, common entities that people would want to use on my Salesforce data in that Sigma spreadsheet and then other people can leverage and build on top of that.
How would you do a two-spreadsheet join in a way that allows all the power and flexibility of the kind of things you get with SQL but not confusing the user and asking them, “Oh, do you want this in an outer join? Do you want an inner join?”
There’s really two answers to that question that we provide. The first is that as an administrator, as a data person, you can actually setup joins like that and so on. For people that want to leverage your choice, they can just use that and go. If I have spreadsheet A and spreadsheet B, I could have spreadsheet C, which is a join of those two. The other side of this, again to go back to really sort of playing both sides of this spectrum, you want to enable those end users to experiment with other choices. What if they want to join it to data set D that you’ve not modeled? Or if they wanted to combine the data in a way that they’re not sure is right yet but they want to experiment with it. And that is a lot of the magic of a spreadsheet—people being able to ask “what if?” questions. And I think a lot of the failing of BI to date is because how can you ask a “what if?” question with live data in a governed world? Sigma brings all the things that are great about a spreadsheet to the data warehouse world.
If I wanted to create a new join, what does the user see?
It’s giving the user a visual interface and giving them feedback. For everything we do in our platform, we look at, how could the system assist me? Say, for instance, I’m trying to do a join or a filter. For all of these types of things, we want to show you feedback on what’s going to happen when you do this. What will the data look like? What will happen? And so all these interactive steps give a much richer experience than just writing code and running the result and having it come back. It’s more of an interactive experience. Just like in a spreadsheet where you might iteratively work through things and play with the data, we’re trying to bring that same type of playful, iterative experience.
There are two types of companies I want to compare you to now. One is the data prep companies and the other is Dremio, which is a catalog and query and ETL delivery sort of company. Let’s start with the data prep companies. Paxata and Trifacta have done a lot of work to allow people to bring data in and their interfaces are all about cleaning data. Looking at columns, understanding bad values, replacing them, having a guided experience for that kind of cleaning so that you can then apply a set of rules to a bunch of data sets and then create a clean version of it. Now, with Paxata, they bring all the data in and clean that data right there. With Trifacta, they allow you to create a definition of the cleaning rules and then apply that as a batch job. I’m getting the impression that you are working on the data itself in place in Snowflake or in any of the cloud data warehouses. You’re not bringing all that data into a repository to work with.
Correct. In our environment, the data never leaves your data warehouse. We’re always running against the data in that single place. Now, I think the other industry trend that you’re sort of poking at here is that we’ve seen a lot of transition from ETL, where people would do a transform as part of moving the data into a data warehouse, to EL and then, much later, T or T as the data gets loaded in. Often that data could be semi-structured, there’s lots of variance that we can explore on that path. But being able to do late-bound transformations and being able to iteratively work with that data is really a key part of this. The data cleansing and wrangling vendors have created a lot of interesting technology but I think the biggest challenge is that data’s not in a state of either being dirty or clean. You need to manipulate the data to meet your needs and every time you join in a new source or continue to work with the data, you might discover another transformation that you need.
As for Dremio, what they’ve done is they’ve created a technology that allows you to assert a table paradigm over a wide variety of data sources, whether they’re APIs, flat files, traditional data warehouses or data sources and they make them all look like a table. Then on top of that, they have a query engine where they bring everything in using Apache Arrow and a variety of other technologies so that they can create a very high-quality query experience out of data that may be in all sorts of different repositories with all sorts of different operational characteristics. In addition, they then have a cataloging layer and a transformation layer which is similar, where they go through transformations based on the tables. And it is grid-like. The strength of their approach is that they now no longer have to be limited to the data in the data warehouse. Now, the difference in their approach from yours is they are not able to do a read-write trip and I assume you can do that at any point in your process.
From a technology perspective, we’re largely agnostic to something like Dremio. From a pure perspective of what Sigma is, we’re translating that spreadsheet interface right now into SQL. So we could certainly leverage something like Dremio. We could use them as a data source. But I think the need for Dremio and products like that brings up an interesting trend in the industry, which is this drive towards the semi-structured data and that’s something that we’ve really been very focused on. Some of the earliest use cases we’ve seen around cloud data infrastructure include people sending essentially raw event data or tracking data—events, logs, all that—straight to that data warehouse and then every row might have a slightly different shape. It doesn’t fit necessarily in a traditional relational schema. And one of the problems that we’ve focused a lot on is, how you make that type of data accessible to a user, where it can’t be pre-modeled because it varies so much. How do you take semi-structured data like that and actually make it accessible to someone in marketing or finance or any of these types of departments?
But you do rely on the data warehouse’s ability to handle that data to do that?
Correct. So we’re leveraging the data that’s in the warehouse and leveraging its support of JSON but I think, getting back to your point on Dremio and sort of the transition here, being able to do those transformations very late in the process, when you actually are looking specifically at the data that you want and giving you a visual interface to interactively pick which fields, what structure do you want to extract out of this and play with—that is something I’ve only seen in our interface.
What’s your fit into the data supply chain? In general, most data supply chains have the following layers. They have the data sources, wherever the data is coming from. They have the landed data, which is the data that comes into the environment that’s being part of the data supply chain. The next level is the modeled data, where you’re taking all the different data and you’re creating the model entities, like user, customer, product, geographies, etc., and you’re creating either copies or views of that data so that then you can look at that and have it be coherent with how you want to think about your business or the problem at hand. Now, the trip between the landed data and the modeled data may require either heavyweight or lightweight ETL. Lightweight ETL is the type of thing that you see people suggesting that is done in the data catalog type companies or the Dremios of the world, where it’s smaller, easier transformations done on well-behaved data. But you may have some lightweight ETL and some heavyweight ETL on the way to your modeled data. From your modeled data to your purpose-built data, which is the next layer, that’s mostly using lightweight ETL. The modeled data is well-behaved and you’re using lightweight ETL to create new versions of data that are exactly what you want for that analysis. And then that purpose-built data is used by applications or analysts and everything.
The impression I’m getting from you is that Sigma is really going to shine in that layer between the modeled data and the purpose-built data. That’s where you’re going to create the most value. You’re not really about creating that modeled data—although you could—really, you’re more about taking the modeled data and making it into purpose-built data and allowing people to do that for themselves, using a very familiar paradigm.
I might speak to it in a slightly different way in that I agree there are many layers but traditionally, when I’ve seen this drawn by any existing BI vendor, they draw it as a pyramid and at the top, it’s going to end in a dashboard or a chart or some sort of visualization. And I think that’s been essentially the biggest challenge and, in my opinion, failing of a lot of the existing BI tools, that they really focus it on this very centralized, very limited model. What they ignore is that almost every user in the company is trying to download data and get into a spreadsheet and then play with it. So I would think of it as, you have data that comes in from the source world and you are going to have many different use cases for that data. It can be that you’re modeling it, it can be that you’re writing direct SQL queries, it can be that you’re building dashboards, it can be that you’re building ad hoc reports and explorations. All of that work is going to work together and that’s really the goal that I have for Sigma.
Would you say that in your current customer implementations, people are working more at the beginning of this pipeline and these layers or toward the end?
We really see everything across the board. Let me give you one example. So Lime, the scooter and bike rental company, is one of our customers. Lime is a rapidly growing company; they have people in essentially every city or municipality around the world. As you can imagine, the metrics that, say, Paris wants on scooter usage will be different from what San Francisco wants or what Sydney wants. The next part of this is that every scooter and every bike they have has a GPS on it. You have massive amounts of data and essentially an unlimited number of questions that you may want to ask about it. And so what they’re doing with Sigma is that yes, they have a relatively small central data team that does some centralized reports and essentially is a resource for people in how to build and model some of that data but their goal is to really enable everybody throughout the company to use Sigma and that’s how you end up with hundreds of people doing the analysis as opposed to five or six in a central data team. The idea is that you’re going to take advantage of anybody who can use a spreadsheet to get what they want and make it work.
Now let’s go to the next challenge, which is really a difficult one. Most companies have spreadsheets all over the place. And sometimes the spreadsheets become artifacts that are important and become the canonical version of something. I can understand that all of these spreadsheets start growing and growing and growing and then because you are connected to a data warehouse, you are able to access these spreadsheets. But now, you have an explosion, usually, of instances of spreadsheets. And the popularity of the spreadsheet paradigm brings with it this massive governance problem and so after a while, the question is, how do you find and how do you catalog and how do you understand which spreadsheet to use and which spreadsheet is authoritative? How do you deal with that?
That’s a great question and very much to the point of wanting to on one hand enable people, on the other hand balance that with governance, centralized management, and accountability. In our system, because it’s all in a cloud environment, you get things like search. I can go find, “Have other people already done this analysis and can I reuse it?” You get things like administration, where I can actually say, as an administrator, I want to annotate this, I want to badge it, that is, mark it as a trusted source, the one that I want to recommend you use. You can even do the exact opposite. “This is done wrong and I want you not to use this anymore, I want to replace all occurrences of this.” You can understand things like lineage. Where did this data come from? What data is it connected to? Who’s using it? All these types of things that you get in a cloud-first world that you don’t get in a traditional spreadsheet world because you’re all running around on your PC. Those core ideas of never move the data, run it in a connected environment, always with the live data, in an environment where you get version control, the kinds of things that traditionally have been only available to software developers, we’ve brought into this world so that you can get that balance of governance with self-service.
Now let’s talk about some of the hard questions about what you’ve been able to do now and what you’ll be able to do in the future. One of the things that’s really cool about Dremio’s model is that they can do cross-repository joins very easily. What’s your approach to multiple data warehouses?
In general, our advice for people is to pick an underlying bit of data warehouse infrastructure and essentially move all of your data there. And so whether it was Snowflake or BigQuery or Redshift or in the future it could be Dremio, we leverage that technology and its capabilities. So we’re not trying to build a separate place for your data. We want your data to at least logically live in one place and we will leverage that.
You aren’t a read write system, are you?
We are read-only against the underlying warehouse but because you can write essentially arbitrary formulas, you can compute whatever you’d like. And so one of the advantages for a lot of users in the data team is that they know if they expose this to a lot of people, they can’t actually break anything or damage anything because they’re playing essentially with their own view of it.
Going back to the cross-repository join, do you think that that’s an important feature to add in the future?
I think it’s an interesting question, but it’s a challenging one to predict. Federated databases have certainly been an idea around for a long time. I think where I’ve seen traction is people having, say, semi-structured data in something like an S3 and wanting to mix that with structured data in a traditional warehouse. Now, how you actually go about solving that, whether it’s an additional feature in the data warehouse, whether it’s a separate platform like Dremio, I think the jury’s out on that.
In this productized analytics framework that I created, at the bottom of it you have a custom kitchen where you can do anything. You know, that’s the most powerful layer but it requires the most skill. Then up from that, you have dinner in a box, like where you get a bunch of components and you have to put them together yourself, like Blue Apron. And above that, you have something like Starbucks or an artisanal brew, where you have lots of options and you can decide that you want a half-caf soy milk latte with extra squirts of mocha or whatever and somebody who knows those options can do it. And then finally, you have what provides the most scalable value, which is like the McDonald’s value meal, which is just a simple dashboard that answers a question in the context of a business process. And so, have you noticed in your implementations that a hierarchy of users who are supporting each other has developed so that each user is doing the self-service they are able to do and moving forward that way?
I think we break your model a little bit in that we would argue we have a simple interface but have virtually all the power of SQL. But what we see is generally a partnership and it’s an iterative partnership with our customers. Just like in the Lime example, you will see a central data team that is still managing the governance, that is still a resource for the company, that is still often the place where they do some of this centralized modeling in Sigma and then you see a much larger number of users that are working on top of that data or even sometimes with the raw data itself but with the full knowledge of the data team, under the security guise of the data team so you can still choose what data you want to expose and how you want to manage permissions. It’s a peer-to-peer model as well and so we really feel very strongly that we don’t want to go to the full extreme of it’s only centralized modeling data and we don’t want to go to the full extreme of it’s just the Wild West and you can just play around on your own. We want the best of both worlds. We never want you to feel that you are limited. I don’t want you to feel that, “I can’t do this because I have to ask someone else to build the report for me.” If someone else has already built a report, I want to be able to leverage that.
How is your catalog run so that people can find the data? Do you have a catalog hierarchy?
We have a few organizing concepts in Sigma. We have the notion of an organization, which is typically a company, similar to an organization in Salesforce, and then we have groups that we call teams, where you can have essentially arbitrary collections of people and those teams and those companies can actually own and host content. It works like a folder system where you might structure some data under sales or under marketing and manage both permissions and organizing your content just as you would organize content in, say, Dropbox or Google Drive.
Is there anything else about Sigma that we haven’t discussed that you’d like to bring up?
We’ve hit a little bit about because we’re a cloud system, we have really this underlying intelligence and knowledge about what’s going on in the system, how the data’s related, how it’s used, what’s important that is really sort of unique to cloud-first systems. I think if you’ve built for the cloud, you can build this type of thing in. If you’ve built for a world where people might have PCs or might run in an offline environment, you really can’t do that. If you look in our product, we have a feed right now that will show you, from a timeline perspective, what are people doing in your company? What analysis are they doing? How can I discover that in a push model, where it’s actually shown to me as opposed to I have to go search for it? It hits on a lot of the problems that you’ve been pointing out of how do I get to the most important information? How do I know what other people are doing? How do I leverage someone else’s work? I want to make all of those very seamless for even business users.