The Power of Backup: A Podcast with Clumio’s Poojan Kumar
On this episode of the Designing Enterprise Platforms Podcast from Early Adopter Research (EAR), EAR’s Dan Woods speaks with Poojan Kumar of Clumio. Their conversation focused on why backup isn’t a key source for data mining, analytics and applications. Kumar is one of the co-founders of Clumio, a modern cloud backup solution. Their conversation covers why, ideally, when you create a backup repository, you have a repository of all the data in a company. In some of the modern solutions, that data is actually accessible and organized, meaning it doesn’t involve just taking images of the backups or data in the rawest possible form — it has actually understood the data and stored it with a schema. Once that occurs, companies can access the data independently of the original system. But then the question is why this data isn’t being used more often for data mining analytics and applications.
This is an edited Q&A of their conversation that can be heard in full on the podcast below.
Woods: Poojan, would you introduce yourself and just tell me a little bit about how you came to start Clumio?
Kumar: Absolutely. Thanks, Dan, for the opportunity, really excited to be doing this. My name is Poojan Kumar. I’m one of the co-founders of Clumio. I’ve spent the last 20-plus years in the enterprise industry at Oracle, at VMware, and then a startup of my own called Pernix Data that got acquired by Nutanix, and then Clumio. So it’s been 20 years of learning enterprise and witnessing all the transitions that have happened. And obviously, one of the biggest transitions that we are witnessing right now with the cloud was the big inspiration, I would say, in jumpstarting Clumio, while taking all the learnings of the last 20-plus years of building enterprise products.
You’re a fully realized enterprise software entrepreneur, in other words.
Absolutely. I wrote a lot of code in my life and have done pretty much every aspect of enterprise company building. So I feel ready to be able to do something like this with the 20 years of experience I have.
Why was it that you decided that backup was the place that needed the most help?
You could say companies are in three categories. There are companies that do something with the data. Then there are companies that watch companies do something with data. And then companies that build tools for the first two. If you see, in the history of enterprise computing, the first category of companies are the biggest ones because those are the companies that are doing something meaningful with the data. So it all starts with the data. Data is the new oil or whatever metaphor you want to use.
It’s interesting that you say the data is a new oil. I heard another interpretation of that saying a better way to say it is data is the new sun. Because oil is used up when you use it. But data is not. It keeps radiating value and it doesn’t burn out the way oil does. But it’s interesting that you make that distinction, because I think you’re right in that if you look at the companies that were earliest in becoming the largest enterprise software companies, they’re the ones who created this authoritative set of data describing business operations.
Exactly, right. It all starts off with do you have the data, and then you can do something with it. Obviously Salesforce could do something with all the CRM data once they had the CRM data and build an app ecosystem on top of it. Snowflake, a sister company of ours, just went IPO a couple of days ago. Start the data, the analytics use case and then go and build more applications on top of the data. That was the fundamental reason for us to go and start with this use case, both because it gives you access to the data, but also because we obviously saw a big void in the industry in terms of what was not being done, especially in the context of this transition to the public cloud and why customers needed to be taken into a journey into the cloud the right way and nobody else was doing it.
The idea is Clumio is all about solving the data backup problem, given the realities of the public cloud.
Correct. Long term, it’s about going and figuring out can you do more with this data that you’re backing up? But yes, today it’s about backing it up.
There’s a long history of companies that have been very successful, who’ve done enterprise backup since the beginning of enterprise software, and now you’re saying that there’s a fundamental change going on and that there’s opportunity for new companies. So what has changed and what is that opportunity?
If you looked at the world before the public cloud, it was all about a data center world where I would buy a piece of software or hardware and essentially build this application myself and the infrastructure of the application myself, and then buy, a potential backup solution that would be sitting on premises with my data center and my application and maybe replicated to another data center. But really, I would go and build all of these solutions myself. Even like up until the public cloud arrived, with companies like Oracle and VMware and all of these companies that were there before the public cloud arrived, you pretty much did some version of the same thing. But if you look at what has happened over the last 13-ish years is that that enterprise computing has now changed drastically. No longer are you just doing this. This is not a world that’s overnight going to transition. But now data also sits in applications natively built in the public cloud, where you’re not managing a bunch of the infrastructure yourself. Gone are the days where you would get a CRM software from Oracle and buy a bunch of hardware and build your own CRM system or whatever. You just go to Salesforce now. And something similar has happened to email. So every vertical has seen this transition happen, both in terms of SaaS—like whatever can be SaaS-ified today gets SaaS-ified. But everything else, if you can go and leverage infrastructure as a service or platform as a service in the public cloud, you’re going to build an application using IaaS and PaaS in the public cloud. But the thing that’s constant about this journey is for any typical enterprise today, the data is spread out. It’s no longer sitting in one place like it was before the public cloud and SaaS solutions arrived. Now it’s spread out between SaaS solutions, between applications built in the public cloud, and then any sitting in your on premises data center. And so that requires you to now think about protecting all of this in a very different way. It must not just be a solution that protects my on premises data center, because that’s not enough. I have this data that’s everywhere so I have to think of a solution that can do firstly across all of these data sources. And secondly, you don’t want to be managing some of these things yourself. So you’re going to use Salesforce for CRM or Snowflake for analytics or Microsoft 365 for email and I don’t want to be in the business of employing people, and maybe I can’t even find people who want to do this job anymore and manage this infrastructure and deal with backup on my own. I want to consume a service that can backup all of my data sources across on prem, cloud, and SaaS with a single solution that I can consume as a service that’s there when I need it, that I know it’s backing it up at all times and it’s there when I need it for a restore or a ransomware attack and stuff like that. So that requires you to think about it differently.
What new problems are there to solve and why can’t the existing enterprise backup vendors solve them?
If you look at all the existing enterprise backup vendors before we came into the picture, they are really built in this restrained world. They built a restrained world of compute and storage that’s running in this box and maybe there’s a cluster of these boxes that are running my software that I’m giving to the customer and the customer is running it and managing it. But number one, it was constrained to be doing that.
You’re saying that those systems essentially were also on premise systems.
Correct. It was a combination of software and hardware, but running on premises. And if you tried to take that software and run it in the cloud, you still are essentially delivering a piece of software to the customers to manage themselves. And by the way, it can be cost-prohibitive because the cloud has to be taken advantage of in a very different way. It can’t be the way you did it before. You have to essentially take advantage of cloud native architectures and build a solution that is amenable to the cloud economics. So that is one thing that comes in the way of the legacy solutions. The other thing that comes in the way is none of these solutions were services. It’s like saying I couldn’t take Oracle CRM and make it Salesforce. It didn’t work like that. Salesforce had to be built from the ground up. It had to be built as a multi-tenant solution. But each customer, each tenant is essentially thinking of consuming Salesforce as the only thing, but the reality behind the scenes, it’s a completely multi-tenant system. Which, again, the legacy products never had to do because that was never the use case, right? You never were looking to deliver a multitenant service, you were giving a piece of software and hardware to an enterprise to run it themselves in a data center.
Basically in the new world you’re dealing with software as a service backup solution. Second of all, you’re dealing with the fact that it’s got to run in the cloud in a new way. Salesforce and most of the other large SaaS companies, they actually had to invent their own computing infrastructure. Salesforce created its own language that was used to create Salesforce so it could work in that multi-tenant way. And then becoming that allows you to get the efficiencies of the public cloud and the elastic scalability so you can deliver your service in an economic way. Then the idea is that this has to run in the cloud in a different, more automated way. What you’re presenting to the user is not a complex airplane dashboard of complexity that they have to manage. You’re presenting a much simpler, more automated service for them, that they can of course configure when they need to, but it’s a whole different ballgame. So just taking an enterprise backup and moving it to that is not something that’s going to be able to happen in a shoehorn where you essentially have to start from scratch.
Exactly. And then add there are applications for the same customers that are running in the public cloud across multiple regions, potentially, in the public cloud—across multiple public clouds in some cases. Maybe because I want to be a multi-cloud company. I don’t want to get locked into a single cloud. But now, as you do all of these things—and add the complexity of I’m definitely using multiple SaaS services anyways, you want a single platform to be able to protect all of this data, across all of these public clouds. And so the platform has to span and work seamlessly across every region, every public cloud, and then also be able to essentially back up SaaS data sources. So all of this requires somebody to come in and rethink the whole back up story and deliver what the customer needs based on where the customer is going, not where the customer was.
Do you backup on premise as well or just cloud and public cloud?
This is taking the customer onto a journey. You can’t just rely on where they want to go, but you also have to meet them where they are. So today, our platform does support some use cases on premises, like VMware. And then we support AWS cloud today. We’re working on the other public cloud vendors. And then we support Microsoft 365 as our first SaaS application and we’re working on other SaaS applications. But we also service all three legs of the stool, so to speak, where the customer has their data today.
What is now newly possible based on this collection of data that you’ve collected from the SaaS and the public clouds into this new cloud backup?
The cloud provides you not just the scale, it also provides you access to a lot of services that you can essentially leverage and do a bunch of things that in the previous world would require you to have custom software that you have to install and manage yourself. But the other thing it does — take Google photos for instance. The fact that it’s sitting in the cloud allows it to go over the image recognition of those photos and now you can essentially go and search a dog, or maybe classify so you can find all the pictures where your son is there, or daughter is there. That is the power of the cloud. If I tried to do it with the processing power that I had on a phone, that would be very, very hard. And so that is what we’re talking about when you take this backup data into the cloud. Before, you were stuck with the equivalent of your phone computing power with that infrastructure that was running the backup. And by the way, the infrastructure was designed to do the backup, not to do anything more because it did not have any more storage or computing to do anything more in the first place. But with cloud, on the other hand, you have to start off with the data first. But as long as you have the data there, you can essentially kick off an army of Lambdas and do a small bit of computing on the data that’s sitting in the public cloud. So that is where opportunities now emerge. If I want to see where in all my backup copies do I have any database or VM or something like that that has credit card numbers stored, you can essentially leverage a service in the cloud, kick off a bunch of Lambdas and very quickly determine if you had any data that had credit card numbers stored in it.
Another important use case for this centralized repository is compliance with data-oriented and privacy-oriented regulations. For example, you could have a Lambda function—and when Poojan talks about Lambda, he means serverless functions or services that are running on the cloud that are automatically triggered when new data arrives or when something happens so that they can be constantly watching out for whatever you want them to watch out for. And one of the things they could be watching out for is personally identifiable information. And so that information then is the information that’s really mostly subject to the regulations and you have to know where it is and you have to know how to essentially forget that information. Another type of use case is understanding and doing a good customer 360 application, understanding any time something new comes in about a customer from any other sorts of data or any other realm, you can start doing that. And then there’s other data mining where you’re trying to find relationships between customers and products and services, especially across many channels and across many products. So this all sounds great, doesn’t it? But I have looked and looked at the advanced backup companies that have got something similar to what Clumio has and I haven’t found anybody that’s pointed me to a customer that’s really taking the full advantage of this. Why are people not taking advantage of the real incredible potential of having a centralized backup repository?
You have to simplify things for a customer. You cannot solve a problem and give a bunch of steps and complexity along with that. And so the reality is that you’ve got to do this and make it very, very simple. So if you have the backup that is sitting, in our case in the cloud, and you essentially deliver a service to them and they log into the service one fine day and they see something show up, it’s like the walled garden and I am exploiting my walled garden.You can see there is a new tab on the service that is telling me that I can do some detection or I can do ransomware detection and stuff like that. I go and I click that and I enable that application, so to speak, and start seeing what kind of change rates are happening and if the change rates are away from the normal. The service is informing me that you have some ransomware going on in your account. Things like this, you have to deliver with simplicity. You can deliver it saying, oh, now take this piece of software, do this and do that, and by the way, you might not have enough computing power to do it and hence, we can do it only once a month or whatever. If you do all of those things with complexity and restrictions, it will not work. And so the reason all of these things that you talked about, Dan, don’t really exist is because the platform and the solution, the way it’s built, has to inherently be able to support it. If you don’t have that, you can only talk about it in a marketing slide but you cannot really deliver it with simplicity to the customer, and hence the customer will not use it.
You’re saying that there’s got to be a layer of services built on top of the central repository to make sense of it, and then maybe another layer of services on top of that to do the analytics and then you can start presenting stuff that actually can build a solution for customers?
Absolutely. It has to be a platform that firstly has the data. And then think of it like no different than iOS operating system. There’s the platform that has the data and then there’s the application that Apple builds, or the ecosystem partner builds. Same thing with Android and things like that, but basically that’s stuff that has to be built. And that’s the journey we are on as Clumio also, starting with the backup use case and then it’s about going and building the equivalent of the email and the calendar, the first couple of apps, and then also opening up so that other applications can be built on top of the data based on the customer agreement and if they agree for their application to be active, then the application can do those things on top of that customer’s data.
Clumio began because you noticed this thing going on with backups. You’ve been seeing a lot of differences between people who have naively implemented cloud backup systems, either for themselves or using products and people who are actually intelligently taking advantage of the cloud. So could you go over a couple of the reasons that you see a lot of cloud backup solutions, whether custom or productized, being done badly?
The problem with the solutions out there is that every solution out there before we came into the market, people were either building it themselves or some other outside vendors were building it on top of the primitives that the AWSs of the world provided. The AWSs of the world provide primitives around snapshotting, where I can snapshot my data and you can essentially keep that snapshot. But that snapshot is not really backup because, firstly, it sits in the same account. So I have hundreds of accounts. So now I have to not only do this in a piecemeal fashion, but the reality is even after they do it and after I’ve paid through my nose for these snapshots, which are not meant to be backups, the reality is the snapshots sit in the same account. So now if there’s a fat finger or there’s ransomware or something happens, somebody can go and delete my primary and my snapshot, which was supposed to be my backup, and I’m back to square one. I didn’t really protect myself. So those are the half-baked solutions that exist today that essentially either take advantage of snapshots or build just control plane analytics on top of the snapshots without fundamentally solving the true backup solution to protect the customer. So with Clumio, we separate ourselves from any of that because we don’t rely on snapshots. We have built our own data platform, which does all the deduplication, compression, and encryption. And data is not air gapped into this solution. So the customer is now sure that even if that account gets hacked or a fat finger or something that happens, all that stuff can go off in their account, but their backups that are sitting with Clumio are not touched.
When you access the data, you also are accessing and understanding more about its schema so that when it is in the centralized cloud repository, you can actually make sense of it.
Absolutely. We already provide some very cool capabilities to essentially say that you have this database data with, say RDS sitting on the cloud and you want to go and see if there was a record with Dan in my backup two years ago. In the previous world, you’d have to essentially take that back up from two years ago, literally figure out if the database version is even supported or not supported. If it was not supported, I have to go and essentially migrate that data into a schema that is supported in a later version and ultimately mount it and attach it to a compute instant and then do a query and see if Dan exists in that database or not. Whereas with Clumio, you can just write that query today and it’ll go in all of your past backups and basically tell you if that row exists or not. We call it granular record recovery feature, where you can actually retrieve a single record from anywhere in the past.
Do you think data protection is really more naturally a backup function than a cybersecurity function?
Cybersecurity is a pretty wide area by itself. But I think there are some adjacent areas which rely on the fact that you have access to the data because I’m backing it up. I also have access to a bunch of controls around the data. I know who is accessing the data and what kind of networking setups have been done in the firewall and then all of that. So all of the metadata around the data is also something that you know as you are essentially backing up the data.
What you’re saying is that there’s a set of data protection capabilities that are more naturally implemented once you have this centralized cloud backup repository. Could you explain one function for data protection that is better implemented using your public cloud centralized repository?
Ransomware. The way ransomware typically happens in any environment is it’s a journey. There is a certain thing that’s infected and over time spreads itself and starts encrypting a bunch of data and more data and stuff like that. You’ll see some anomalies in terms of how the data is changed. You typically might be changing it at 2, 3, 5% in a day, but suddenly you might be changing 100% of the data because you’re encrypting the whole drive or whatever in the same 24 hours. As a backup solution, you can detect some of these things because you know you’re backing it up so you know how much of the data is changed and if you see these anomalies using machine learning techniques and you can’t prevent it but you can flag it and say, you probably are seeing some kind of ransomware attack that is festering in your environment.
This is your second startup. You’ve worked a lot in larger companies. What have you learned building Clumio, as an engineer and as an entrepreneur?
This is at least my second time around as co-founder and CEO of a startup. When I was at my previous startup, you’re learning on the job a lot more. This time, it’s a little different for me. It’s a lot more comfortable in some aspects, in terms of the learning part. If you look at my own career, it was basically a career that was built with all the stuff that happened in the last 30, 40 years ago, basically on the on premises side, and I worked at Oracle and VMware, and also my previous company was also building a very on premises product. From that to building something on the cloud and also delivering as a service with all the CICD behind the scenes so that we can literally upgrade the service like a Netflix or Facebook or any of these big consumer companies do essentially without causing impact. I think that has been a lot of learning for us over the last three years or so. Firstly, it’s about the economics of the public cloud, how do you really build something on the cloud the right way while not breaking the bank both for yourself and also delivering the solution at the right economics. And there’s a lot of financial engineering that also needs to happen in the public cloud. That has been a kind of learning for us. No different from a customer, if you think about it, as they move their application from on prem to the cloud, if they literally do lift and shift, they basically will get sticker shock in terms of the cloud and how much it costs. But if they refactor it the right way or if they have the opportunity to build a new one in the cloud itself, then you take the real advantage of the cloud, both in terms of services and economics. And we are something similar. We are a company, even though we didn’t have to take any application on prem, we’re taking an application that was historically built for the on prem world, which is a backup application, and we have built it in the cloud. And so there was a lot of learning for us to be involved in terms of how to leverage cloud the right way.
How do you go about defining the surface area of the product? One of the things that needs to happen in order for a SaaS solution to work is you have to have commonality of requirements. And so the reason that Salesforce has become so big is that, if you think of the 100% of possible CRM functionality, most people want the same 20%. But you know, if you look at a variety of areas, there’s all sorts of specialty CRMs for real estate agents, for country clubs, for all sorts of others. But they’re not nearly as big as Salesforce because a huge amount of people want that same 20% and Salesforce has become really big delivering that sort of common set of requirements. If you look at most enterprise software in the past, it tried to encompass that 100% and allow you to configure and customize your way. But in the SaaS world, that’s not true. You have a much smaller set of configurability and that’s the way that you can deliver this solution as multi-tenant fashion and in a web fashion. What have the struggles been to either provide something that is not too narrow so people can’t configure themselves into their own use case, but also not too broad so it then becomes too complex and doesn’t deliver on the SaaS promise?
The surface area of the problem we are after is also pretty wide. If you try to do all things to everyone, be all things to everyone, you’re not going to deliver it in time or you’re probably going to die before you deliver it or you’re not going to deliver it with the right simplicity, which is very important with the SaaS solution like us. There’s a bunch of standardization that you can count on at least in this world, where my data is sitting in repositories that are pretty well-known. In the sense that if I’m on prem, I probably have a lot of VMware and maybe I’m using a VMware solution in the cloud also with BMC. If I’m on AWS, I’m probably using some combination of S3, RDS and EBS. For the most part, 99% of my data is probably in these data sources. So there’s a bunch of rocks out there that you can actually focus on and then there is a lot of sand that’s also out there. We are obviously focused on the rocks, especially the rocks that emerge as you move into the cloud and then also focus on the big SaaS applications. I think that helps in terms of focusing and really going and building, at least in the first few years, going and building stuff that customers will pay for and that’ll help the customers.