Activating Data Management: A Q&A with Commvault’s Patrick McGrath

Early Adopter Research’s Dan Woods is very familiar with Commvault, having followed and written about the company over the past few years. For a recent episode of the Early Adopter Research Podcast, he had the chance to speak with Patrick McGrath, Commvault’s Director of Product Management, about Activate, a new product that allows people to get a different kind of value from the data that they have under management in Commvault. Commvault is an established backup and recovery company that has developed a very deep and wide offering in the backup space.

Commvault recently reorganized their products to highlight the power of the platform. Complete provides all backup and recovery facilities. Orchestrate is about moving data, and HyperScale stores large amounts of data. Their new Activate offering can be purchased as a standalone product or used with all of their other products. Commvault has a 4D index, which is a way of capturing metadata about the data that is protected. The 4D index allows companies to collect huge amounts of metadata about where the data is stored, who has access to it, and what other data it is related to. Commvault abstracts the data from the storage mechanism. This is very useful for backup and recovery because it means that companies can recover to a different format than what it was restored from.

This is an edited Q&A of the conversation that Woods and McGrath had for the podcast.  

Woods: What is the rationale for creating Activate?

McGrath: I used to manage a data center and storage teams and look a lot at infrastructure where I thought about data as blobs and moving those blobs around, optimizing those blobs and trying to save money on those blobs where I could.  But I think as we start taking a look at Activate, the thing that’s really exciting me is that we are actually moving from dealing with those abstract blobs of data into dealing with applying intelligence to the workflows that we support to do something that is very outcome focused.  

With Activate this year, we repackaged much of the Commvault platform to make it a lot easier to buy and understand. Commvault Activate actually has a number of different components inside it. The first foundational component is the Activate platform itself, which is abstracted from the CommVault data platform.  That allows us to do is to be able to do things like indexing data, sourcing data from backups and archives that may already be underCommvault management, but also directing that content indexing focus from data sources that we currently don’t manage.

Commvault is a data management, data protection, backup and recovery company. Activate is now going to be used to help with data management use cases, and some of the first that are coming along are the idea of file optimization.  That allows you to look at where all your files are, see what’s duplicated, and then determine how you can make better use of the storage you have. Another one is the idea of managing sensitive data and finding where personal information is stored, so that you can comply with regulations. But what I want to know about is how can we get business signals from the backup repository or, if not just from the backup repository, from Activate itself? Because it is interesting to me that a lot of the goals that we have been trying to achieve with other technologies, such as data warehouses or data lakes, now we seem to, with a lot less work, be able to get to some of those results using an approach like Activate uses.

Commvault Complete, which is our backup and recovery offering, uses our 4D index to take a look at details about the files themselves — what kind of files are they, what are their names, where are they stored, what kind of controls do they have placed on them?  But with Activate we actually build on with another few dimensions on top of that. The first dimension is what is the content of the file. That’s really important so that we can actually start doing the next piece, which is, is there information about the file that we can use for other purposes.  And this is really where we get into entity extraction and classification, so that we can actually start defining classes of data that we can treat differently.

And then the last component of the 4D index is one where we are leveraging a lot of AI type technologies, including machine learning, NLP and so on, and using those types of tools to enrich that classification about the data so we can make the information about the data itself much more accurate and extend that to automation use cases and so on.  

To actually make good decisions about what data should be kept, what its performance characteristics should be, how it should be architected, what implication does that have on availability, you actually need to know and to have a lot more information about the files themselves and their contents. An example of this might be you have a bunch of data in your file storage system, and you find out, oh, there’s a bunch of duplicated data that’s, of these other copies aren’t being accessed.

Or companies might have different tiers of data where they have data that is accessed a lot, data that is accessed less, and data that is hardly accessed at all, and they can now move those into the appropriate tier with the appropriate cost, correct?

Yes.  I think a lot of that information can actually be determined based on basic metadata that, for example, Commvault Complete would capture for us.  But imagine your IT organization coming to you, and saying, we want to save two million dollars a year by putting in a two-year policy on all your emails.  Anything over two years, we want to ditch. That is all very fine and good until you  look at it from a legal or tax perspective. Those departments would say you want to save two million dollars, but you want to open up the possibility of a $20 billion-dollar compliance and legal exposure based on getting rid of the wrong stuff.  So what we often find with unstructured data especially, and unstructured data is the vast bulk of data that is under management, that people often lump valuable or sensitive information with stuff that’s not so valuable and sensitive.

Commvault uses Activate in two ways.  One is that Activate can be used with the Commvault backup system and in that case, you actually have access to the data itself.  Or you can use Commvault Activate, on its own, in which case you have access to the metadata. But when you are actually using it as an application, you only have access to the metadata index. What’s the difference in terms of your ability to extract business signals between those two ways of using Activate?

If a customer is using Commvault Complete backup and recovery on a section of their data, they will have a whole bunch of the technical metadata associated with that file already in the index.  To really understand what’s in there, that data has to be content indexed. In other words, we have to go through and actually be able to understand what’s really in those files. But if we are dealing with an organization that, for example, may use us for backup on their servers but they may not be using us for backup on their endpoints, we’ve actually got the ability to do the crawl, to do the indexing and to do the content indexing from those unmanaged data sources.  And we are finding that we’ve got a number of companies that naturally assume that they have to have our backup software to access the abilities of Activate, but more and more we are starting to talk with companies, particularly with issues like GDPR that have come up, where they say, look, we know that you guys are not managing backups for our endpoints at the moment, but endpoints are a significant threat vector for us. Let’s go ahead and actually do some representative analysis on our endpoint section, and we’ll provide you with some sample data for that and let’s just see what we kind of end up with.

But from an analysis point of view, there are times when I’m going to want to extract a signal and I’m going to need to actually access that data itself. Is there a difference between my ability to do that when I’m using Activate in a standalone way and when I’m using Activate with Commvault Complete?

Once you get past the data sourcing question of where to apply these analytics, you’ve got the ability to start building additional dimensions into your indexing, so you could re-index it. The typical workflow for Activate would be to define what kind of entities to start with, to put those entities into a data classification plan, define what my data universe looks like, and that data universe could comprise the backed up data as well as live data and then let me run my analytics. It’ll run through the search and it will actually do all of the entity extraction and classification from that point.  Now, at any stage, you could decide to add additional entities that you are trying to look for. And some of those entities obviously could be powered by regular expressions or AI or leveraging APIs, but you’ve got the ability to reapply content indexing which will invoke those operations.

So that if I have Commvault Complete, the data is already there, and so the crawling and indexing might be a little easier because you don’t have to go to the source, but if you’re using it on a standalone level, then you have to go to the source and then access the data there?

Right.  Part of it is at what level are you taking a look at it.  So if we’ve got data that’s already in our virtual platform as the result of an archive or a copy. When we actually do indexing for data that is on a live device, we’re not actually collecting the raw files and putting them into a repository.  What we are doing is taking the full index of that asset and putting it into our index. We’re scanning the data, applying whatever other algorithms you want and then the data doesn’t get stored. The difference is that in the CommVault Complete you may be scanning the backed up copy. So the impact on the environment might be a little bit less, and then you can rescan as many times as you want without any operational impact.

It seems like that with this process, either using Commvault Complete or not, you have essentially what many people have been looking for from the idea of a data lake. You have access to this large amount of repository, you have the ability to apply different algorithms, and to extract different entities from it.  And then you have the ability to use those. So what does it look like from a developer point of view when you are looking at Activate as a way to collect all of these signals and start using them to create an application?

We have an in-built toolkit with Activate so far and we’ve exposed some of that as usable tools right now so we can actually create those workflows using that toolset.  I think if we’re actually taking a look at scenarios where we’re building in a lot of customized analytics, like if we had documents that had embedded pictures in them, we might do full-text search as the first content indexing, but we may not have run a deep learning algorithm on it to actually understand what’s in the pictures that may be embedded in those documents.  So that would require a second pass. So for those sorts of scenarios, we’re actually working with our development team so that we’ve got the ability to start extending the platform much further those sort of development scenarios.

If I’ve got Commvault Complete there, I have access to the data itself.  And then as the developer, I am going to be adding those entity extractions, those advanced algorithms that may get used during the crawl.  And then once I have all that back, I then now have that in a repository, the four-dimensional index, that becomes the foundation for my application?

Exactly, Dan.  And we’ve got a really good example of that at the moment with Lucidworks, where we’re actually doing a lot of joint development of the AI within the CommVault platform.  And we’ve got a great example of the approach that we want to take with our ecosystem with a company called Brainspace, and they have created some human augmented intelligence methodologies combined with some beautiful visual, data visualizations that support the early parts of the e-discovery cycle.  So out of a 20 million email set you may actually say, I wanted to take a look at refining down to a set of maybe two million emails. From that list of two million emails, we can use our tools to be able to create that set and that set is passed across to Brainspace through APIs that actually access that through that integration configuration.  

So the idea is that it is a two-way street.  You can use the advanced technology of Brainspace or Lucidworks to extract and learn from the index.  You can then do things inside that advanced technology, to get more signal, and then you can store that signal back into the 4D index?

That’s absolutely correct, and so the concept there is to be able to create an intelligence hub around your data.  And of course, that intelligence is going to be kind of relative to the special use cases that you’re applying.

The signals get stored inside the 4D index and that becomes the master repository and then that can be moved to wherever it needs to be moved to do whatever you want to do?

Absolutely.  We’ll be using other kinds of signals too.  Like for elements that are frequently searched or frequently accessed, we use those kinds of usage signals, maybe not the signals derived from the asset itself, but maybe about the usage of that data back to help do things like refine relevancy and so on with the data gathered from the rest of the organization.  

What is motivating the partnerships that you are developing?  Why are you working with Lucidworks, with Brainspace?

What’s in it for us is we gain access to new use cases, new conversations, new personae inside an organization.  We can have discussions about the use of the data that we’ve never really had before, because often we get pigeonholed into a backup/recovery type of discussion.  That doesn’t mean a lot to somebody who is actually trying to upgrade their customer experience program, for example. So if you take a look at it from our perspective, it opens the doors to a lot of additional applications for the data that we are managing.  

If you were to take a look at it from a Lucidworks or a Brainspace perspective, the data environments themselves are heterogeneous and there are a lot of different data sources that have proprietary APIs and so on.  We actually provide a way, if we’re talking about from the backup/recovery pedigree, where we’ve got huge amounts of data that we’ve already gathered for a process that an organization should be doing anyway, which is protecting and safeguarding their data.  So if we’ve already got a huge amount of that information, we can augment it with different types of analyses from data that’s not held within those sort of data protection strategies, and that’s a great source of data that can be used with great breadth for these discussions of getting value out of the data for an organization.  

All these products that are based on AI or ML algorithms can work much better because they have the fuel that they need more easily accessed?

Yes.  And the fuel’s a really interesting thing.  If you take a look at GDPR and you take a look at the proliferation of those kinds of data privacy regulations, when you actually start taking a look at what is defined as personal data and where in my organization that could be, it’s all over the place.  So maybe not credit card information, which hopefully is siloed in a particular area, but when you start talking about information about people, our organizations are tuned to be dealing with people. So you’re going to have stuff in your marketing area, your sales area, finance could be a location that you’ve got information about partners and billing and all those kinds of things.  You’ve got information that’s on laptops, you’ve got information that’s in SAS systems, for CRM, for HR systems. This personal data is all over the place. So if you’re taking a look at a use case like that where your threat is so broad, then of course you need really broad access to the data. You need to be able to figure out where those pockets are. You need to be able to take action on them proactively to try and ensure that the bad side of the risk doesn’t happen.  But you also need to be able to take action on those elements fairly quickly, which means you need to know what you’re looking for, you need to know where it is, you need to know who owns that so that decisions can get made.

How will you know when Activate has succeeded?  

The obvious metric is adoption. We’ve got an interest in exposing existing customers to it so they can really extend the use of their current investments in CommVault.  But I think there are a number of customers that are really interested in it based on short-term, high priority initiatives that they have. We’re finding that there are competing products that are on the market that are actually quite specialized in their use case.  And once that itch is scratched for the short-term value that they are looking for, people have a desire to start saying, hey, we could actually be doing this other operation with this kind of broad access to the data. So we’ve seen some of that in terms of customers already that have taken a look at Activate before a cloud migration to try and clean up their data so that they are not adopting huge costs and they’re taking a look at hardware upgrade avoidance so that they can use those savings to fuel extra strategic costs.  I think one of the things that we will be doing a lot more over the next year is to talk to our customers about this progression of short-term wins, whether they be in the sort of efficiencies or risk and compliance space, to actually start moving in and seeing where else they could apply these kinds of tools.