Not All Cloud Data Warehouses Are Created The Same: A Podcast with Actian
On this edition of the earlyadopter.com podcast, Early Adopter’s Dan Woods dove into the topic of what cloud data warehouse best fits your needs. Joining him for the conversation were Raghu Chakravarthi, Actian’s chief product officer, and Bill Westfall, Actian’s VP of customer success and professional services. Early Adopter recently completed two papers with Actian, one on changing the concept of total cost of ownership about cloud data warehouses to total cost of usage and the other about the five myths of cloud data warehouses. Those papers tie into this conversation, which covered:
- 3:00 — Why new cloud data warehouses are needed
- 10:30 — The new frameworks needed for cloud data warehouses
- 30:00 — Why TCO is outdated and the lessons Actian has learned about implementation
Listen to the full podcast or read an edited version below:
Woods: We’re going to talk about a really interesting topic, which is how do you figure out which cloud data warehouse is best for you? What we need to do is figure out what’s the real difference between the cloud data warehouse opportunity and the typical data warehouse opportunity because people think that when you approach a cloud data warehouse, you’re lifting and shifting a data warehouse use case from on premise into the cloud. That’s possible, but that ignores a huge amount of potential that you have in the cloud for doing things differently, creating new patterns of work, creating new ways of analyzing and using data, and also taking advantage of the new architectures in the cloud that allow you to perform a lot of the tasks that you do in the on premise world a lot more efficiently. What do you guys see as the important potential of moving the data warehouse and moving analytics to the cloud?
Chakravarthi: When customers think about data warehousing and specifically in the cloud, they’re looking for a few new architectures as well as enablers for what they want to do. Instead of the traditional lift and shift, what they are looking for is dynamic elasticity. They’re looking for performance at scale. And they’re also looking for a hybrid design. Traditional data warehouses, for example, typically live on prem. And they require dedicated hardware and run sometimes on customized hardware. The cost became truly unattainable at some point. And the hybrid approach, as well as the cloud data warehouses, came into play at that point. From a performance and scale standpoint, it’s very important for customers to scale to what they need during different parts of the day and during different workloads that are executing. They have to have the capability to contain costs and scale down. And typically the legacy data warehouses or the on prem data warehouses are sized for peak performance or peak usage. The second thing is hybrid by design. A cloud data warehouse supports data that runs in the cloud or data that’s residing in the cloud, and also data that’s primarily living on prem due to security constraints as well as other constraints that don’t let the customer move the data to the cloud. Hybrid design and a hybrid cloud data warehouse is an important concept to address. When you have these two, you’re really looking at a better cost for performance. Those three factors, performance at scale, hybrid design, and getting a better cost for performance, are what customers are looking for these days in a cloud.
The evolution of analytics can be explained in terms of two dimensions. One is how IT-driven, user-driven, or automated the analytics are, and the other is the user base, whether it’s business users, data analysts, data scientists, or intelligent applications. What are the demands made by this evolution? And why are cloud data warehouses able to meet them?
Westfall: I had some early relationships with Bill Inmon, the forefather of data warehousing. Data warehousing has come a long way. Data demands have come a long way and the demands of analytics and helping customers not just with some of those technical frameworks, the data warehousing created a whole new way to look at accessing information in new and creative ways. But what’s happened over almost the last 10 to 15 years is that the introduction of the Internet of Things and machine learning and AI is placing huge demands on the architectural framework companies are having to consider, not just data warehousing in the cloud, but a lot of companies need to be able to access information that can’t be or won’t be—for at least the foreseeable future—moved into the cloud. So analytics is becoming critical to have the ability to bring together all that information, regardless of where it lies, and use it in new and creative ways to look at the business-to-business and the business-to-consumer opportunities and use it in new and creative ways to create new services and products from a strategic perspective, help the companies.
I think what you’re essentially saying is that in a modern data management and analytics landscape, you’re going to have some data repositories and analytical capabilities that are going to be on premise because the data has certain properties that need to stay there. You’re going to have other data warehouse and analytical capabilities that are going to be in the cloud, but you’re going to want to create unified pipelines so that you can treat that landscape as if it’s one unified repository. And then, if you do that, you’ll be able to create pipelines that can support what you talk about in the slide—the statistical modeling, the machine learning—to achieve the predictive, descriptive, and prescriptive use cases by creating a pipeline that can put work wherever it needs to be in the cloud or on premise, but then treat the whole thing as a unified system.
Westfall: Nicely summarized. And what it does is make it so that with the right infrastructure, the right architectural framework, like what Actian has, it allows that access to be mostly transparent to the business users because the last thing the business—the C suite and some of the business decision-makers—want to do is worry about how that all gets tied together.
Why are cloud data warehouses different from on premise enterprise data warehouses?
Westfall: It starts with some of the early foundations of what data warehousing was intended to address. It was really looking at providing a simple framework, a different way of looking at a relational database technology so that now you could focus it on marketing or finance or various business channels within an organization. Data in and of itself is not very intelligent. What’s left to the end-users is for IT to really build out frameworks for the data to be accessed in a creative, informational framework. I’m making the distinction here between data and information, data being basically just raw data, information being the intelligent compilation of that information. What’s happened now in the evolving information paradigm is that there’s now many more repositories of information that have come into that framework as mentioned already—machine learning, Internet of Things, artificial intelligence. And there are companies with terabytes—if not petabytes—of information just sitting there, mostly unused. It’s critical for companies to spend the time to look at all of those data sources, whether it’s real-time ingestion of that information and that data to create informative types of discovery or whether it can reside on premise, in the cloud or a combination of the two. The evolving information paradigm is to look at adapting and expanding the current field of vision based on all those data sources and having the tools, the technology, to tie it together. Companies are having to create some very complex views and infrastructure to do that. Actian has a way to do that behind the scenes, again, so that the business decision-makers don’t have to think so much about the architectural framework.
The idea here is that you go from a simpler world, in which enterprise applications are being consolidated into a data warehouse. Then you put on top of that different tools that allow you to create analytical systems out of it. In the modern world, you have many more sources of data and many more repositories. You want to be able to unify those in data pipelines to bring data together to create it in purpose-built forms that then support different types of applications. This seems to be the impetus for the data ops movement—it’s the recognition that you’ve got so many more data sources, so many more data pipelines you’re going to have to support; if we do this with handcrafted, super low level sorts of systems, we’re never going to be able to support it at scale. And we need to productize this management of this complexity somehow.
Chakravarthi: Yes. Architecturally, cloud allows companies to bring compute to where the data lives so that they can do faster processing and produce insights faster. The way you do that is federation. Federation is enabled through the data pipelines that you’re talking about to enable data pipelines to be built and expose data in more of an API format. The cloud data warehouses are now building this whole notion architecturally speaking as the data fabric. So in Actian and Avalanche, we have a data fabric based on Spark that’s built in. And voila, right on top of Spark you can start building data pipelines, leave data where it is in your best of breed locations as well as best of breed platforms. And you bring it all together in this new enterprise or cloud data warehouse format through pipelines.
Westfall: When we move away from the concept of data—just unintelligent data—to this concept of information, more and more companies are realizing that information is key to building new strategies and products and services to their own end customers. What that’s led to is a huge upsurge in the number of end-users. That’s why, as companies start to look at these data warehousing platforms, whether it’s on premise or in the cloud or hybrid solutions that Actian supports, concurrency becomes critical. It has to be a scalable solution architectural framework.
In other words, just being able to get the right answer isn’t enough. You have to be able to get the right answer in a snappy way in order to really keep all this going.
If you have this ability to extend the data fabric across a much larger data landscape, if you have the ability to then query it in a unified manner and then get that answer quickly, you now are able to do different kinds of work based on that paradigm and capabilities. What patterns are made possible to handle this with cloud data warehouses?
Chakravarthi: One of the main things is a cloud data warehouse is easy to use. It’s easy to operate, it’s easy to consume, and it’s easy to integrate with. And that’s all brought about by this whole notion of a managed service. These days, in a cloud data warehouse, the emergence of database automation really has been spurred by this offering of a managed service.
The other thing that seems interesting about these capabilities is you create functions that do really complex things. One of the functions in your product and in others is the idea of in-database machine learning algorithms being able to be run. How does that work?
Chakravarthi: Advanced analytics is such a complex exercise that only a few data scientists in a company can afford to learn about it and be knowledgeable in it. But for every data scientist, there are maybe 100 business analysts who are data-savvy but who are not algorithm-savvy. So in-database UDF-based advanced analytics implementations are really making a huge play here. These kinds of advanced analytics algorithms can be implemented and be directly in the database and be called via SQL as a function. And you pass it inputs and outputs and it automatically now enables a business analyst—and not just a data scientist—to do advanced analytics algorithms and implement that right into an application they are building or into any kind of insights that they’re building or dashboards, for example. That speeds up the notion of getting advanced analytics and operationalizing them pretty quickly.
The databases themselves are becoming more powerful in certain cases by adding GPU processing to them and also—especially in the machine learning capabilities—distributing the processing of those over huge gangs of GPUs. Is that the kind of thing that’s happening under the hood automatically in a lot of cloud data warehouses?
Chakravarthi: Yes, absolutely. And cloud data warehouses like Actian’s are embedding technologies like TensorFlow and H2O.ai directly into the database and have in-database calls to these APIs. And these libraries leverage GPUs. They also leverage ASICs, tensors that are built right on top of ASICs, which are obviously more cost-effective. And all of these libraries are now supported in cloud infrastructures like AWS natively based on instances that these cloud service providers provide to support these libraries, whether it’s a GPU-based, ASIC-based or even SPGA-based algorithms. All these cloud data warehouses have user-defined functions where the call to a TensorFlow or an H2O is built. And secondly, the call to a H2O or TensorFlow has to be accompanied by a certain amount of data, whether it’s a training dataset or some kind of a scoring dataset. That is enabled via the data fabric that is built into the architecture.
Cloud data warehouses by definition use more cloud computing, and the best ones support hybrid operational and analytic distribution. You can have a cloud version, an on premise version, and you can treat them all as if they’re one unified fabric. But most of these cloud data warehouses are able to reach out to other similar repositories or processing capabilities and the integration is pretty tight. You can create pipelines that then can be managed without lots of complexity. So where are we going with these distributed data processing frameworks?
Chakravarthi: You just described this whole notion of data federation. Now, data needs to be distributed. Data has to live in the cloud. On the other hand, behavioral data that I’m collecting about a customer and I’m really applying machine learning technology to really understand customer patterns, their values, and their attributes, that data is going to be primarily on prem because I want to protect that data. I don’t want to expose it. There has to be a federated query that needs to happen in a cloud data warehouse environment that joins data from the cloud with on prem data and produces business insights that help me drive and make decisions on what campaigns work or what don’t. All of this is getting built on top of pipelines such as Spark and Presto, whether it’s SQL-based or Python-based or R-based doesn’t matter. Every insight that I produce that I learn from is going to be pushed back into the next campaign that I’m running tonight or tomorrow night and makes that campaign better and more targeted.
Why is TCO an outmoded metric for cloud data warehouses?
Westfall: The general principle of total cost of ownership is still somewhat applicable. But in the cloud now, because of the infrastructure as a service, platform as a service, and now as more and more data warehousing as a service come online, it’s more of a managed services offering. Now what you’re considering is you pay as you go. You don’t have the complex infrastructure that you have to manage yourself with the network engineer and the database engineers. In an “as a service” framework you’re paying as you go. And there’s different pricing models for that. There is this concept Actian supports where you’ve got the compute separate from the storage so you pay for storage, you pay for compute in a different way. So this new model now focuses on total cost of usage.
What do you see as the key differences in the way that cloud data warehouses are constructed? Why are they not all the same?
Chakravarthi: There are multiple choices here. Let’s go back to the legacy data warehouses. There was Oracle Exadata, which, architecturally speaking, is an SMP, which means I can scale it vertically to however big a situation I need. And then there was MPP, and Teradata pioneered MPP, so I can scale horizontally and keep going that way. The same kinds of concepts apply in cloud data warehouses as well. Snowflake is going after the business units within a large enterprise’s use case, which is the long tail use case in a sense. Concurrency is not that important. I have maybe four or five users hitting a few terabytes of data that I want to quickly analyze, but I want it to be easy. I don’t want to go manage and maintain and monitor all of these things. Within a business unit, I can produce a dashboard and show some insights and make my boss happy. Whereas when you really want to productionize this at an enterprise level, you really need to think about concurrency. You need to think about scale. And also you must have elasticity so that you can manage cost at scale and not just keep ballooning up the cost like a legacy data warehouse would. So not all cloud data warehouses are the same. You have to really think about your particular use case and fit the warehouse and make the right choice. Cost is important, but like we said in the previous topic, cost per query, cost per execution, and the opportunity cost is a lot more important than just measuring everything by total cost of ownership.
What are the learnings that you have had in terms of cloud data warehouse strategy and analytics strategy?
Westfall: Just to be clear to the audience, I’m not part of sales. I’ve been managing, running, and operating professional service organizations for 25 plus years. In those years, I’ve seen the evolution of people, process, and technology and how each of those elements come together supporting business needs. The information that’s contained in data and the usage and architectural framework all serves one purpose, and that is to help customers become more competitive and introduce new products and services. So Actian’s lessons learned are really foundational lessons learned based on those experiences and my experiences over those many years. Customers need to be careful about looking at the next best bright shiny object. It’s easy to have a vendor come in and say, “We’re the fastest. We’re the smartest. We’re the most cost-effective.” Ultimately, you need to take the time to evaluate what your needs are. You have to be careful with the scaling, the concurrency, and the number of users because it may seem bright and shiny when you first get it out of the box, but then as your organization starts to leverage it and it grows, it may become more costly than you would expect and it may become more complex. The other element of these lessons learned is that there’s no simple answer to this architecture modernization. Companies need to take the time to understand their business needs tactically, architecturally, and then, what is going to help them evolve to the next stage of their own evolution. That takes time.
Give me an example of a future-proofed aspect of an architecture.
Chakravarthi: Let’s take the concrete example of migrating from Teradata. It’s one of the hardest data warehouses to migrate from. What we are doing is taking a phased approach. This is the biggest lesson that we have learned as a company—you shouldn’t mix migration with modernization. That’s how these modernization exercises fail. The first step is to do a lift and shift, scope out that project so that you have the right set of hardware and you have the right choice to migrate from, whether it’s a hybrid cloud solution or a completely cloud-based, a public cloud-based solution. You do the migration first. In this process, you need two things: a good partner in terms of technology, which the cloud data warehouse vendor is, and, secondly, a good partner in terms of SI. Then comes the notion of modernization, which is all about modernizing your pipelines, using the infrastructure that the new cloud data warehouse provides to you, whether it’s pipelines based on Spark or Presto or other mechanisms, and creating new opportunities and new insights based on the data that’s already been migrated over. You modernize by writing API-based access to data. Now you can do real-time access instead of the legacy systems supporting batch-based access. Then you can really get into the next level and scale your operation with a controllable cost.
Westfall: One thing I wanted to add before we leave migration is I had many, many customers over the years underestimate the time and the cost of migration. Don’t underestimate that. I mean, it’s always going to take longer and cost more than you expect.
What have you seen as the reality of the way your staff changes when you adopt a cloud data warehouse?
Westfall: One of the biggest challenges is for companies that have the ability to move everything into the cloud is what to do with existing staff. Some forethought has to be given to how you reallocate these resources or if you do that. One of the bigger issues is that a lot of the true data warehousing as a service cloud providers will tell you that they offer zero administration cloud platforms. There’s no such thing because, in the end, you still will need staff to monitor, understand those environments, even in the cloud, even under a fully managed services platform, because there’s always going to be reasons to have people who are monitoring through your own dashboards what’s happening, the growth, the queries, how they’re impacted by concurrency.
What are the lessons related to governance and oversight?
Westfall: This is sometimes overlooked. In the 20-some years I’ve been helping companies with bringing together people, process, and technology, the most successful implementations happen when you can bring together an oversight and governance team that is comprised of IT—even if IT has initiated some of these considerations, moving to the cloud or hybrid solutions—with the business, executives, leadership and oversight. A lot of these companies forget to bring in some of the vendors. You might have an offshoring leadership team that’s providing resources. You might have a tool vendor that you’re using—a Tableau or Looker, for example—that provides some of your key executive or leadership reporting. Just keep in mind that you should have a team that’s overseeing it, that you’ve got the right stakeholders all committed to the success of the project with an eye toward the future state.
And the vision of the data warehouse was set by the idea of the product. And in the cloud, you’ve got a new vision. Each cloud data warehouse supports different kinds of capabilities. You’ve got a much larger vision white space to define yourself into. This has been a hugely interesting podcast for me. Thank you.