The Gist of the Mission

I first heard the term data lake in 2011 at a meetup where James Dixon, CTO and founder of Pentaho, which is now a part of Hitachi Vantara, explained the concept. At that point, Dixon used the term data lake to explain what was needed to allow management and analysis of big data now that Hadoop had arrived and made it affordable (see “Big Data Requires a Big, New Architecture”).

Since then Hadoop has come and gone as the star of the big data show, but the idea of the data lake has persisted for the reasons Dixon pointed out. We need a way to store large amount of data, manage it and make it accessible, process it in new ways, and deliver the results in production.

Since the start of this journey the notion of hard-to-code MapReduce batch jobs has morphed into the superior YARN framework, Spark has risen as an important new processing platform that is much more application friendly than Hadoop, and the cloud providers are all busy at work explicitly creating platforms to support what I call data supply chains.

As I pointed out in this article in 2013, Hadoop was on its way to becoming an embedded system. That process is now complete. The Hadoop vendors, Cloudera, Hortonworks, and MapR don’t talk about Hadoop any more. They are all busy attempting to build larger product visions around core capabilities for big data processing. Hadoop is in there somewhere, but the elephant is nowhere to be seen.

That’s where we are now stuck. The latest wave of data lake implementations has not been broadly successful at making big data an asset that informs a wide swath of business decision-making. There have been major victories with respect to using Hadoop to build data supply chains to provide high value data to applications and to create certain types of data marts. And many new applications of AI are continuing to provide value using special purpose data supply chains.

But being able to build, with great effort, advanced data pipelines is not what we want or need from a data lake. What we need is a way to store and manage big data and allow as many people as possible to access it, distill it, analyze it, and find important signals in it.

A clear vision of a data lake that is as well defined and operationalized as the data warehouse has eluded us.

This research mission is about defining a complete vision for a data lake, one that has all the missing pieces that were not included in the first wave of implementation.  

Our goal is to find a way to take a data lake from a stalled or unproductive state to playing a central role in making big data useful to support high value use cases that have a major impact on business results.

The sustainable advantage that this research mission seeks to describe is the full implementation of the data lake, which performs as originally intended and overcomes some of the challenges encountered by businesses during the first wave of data lake implementation.

In our view, the data lake remains an excellent idea to solve enterprise data needs, but only if it is more fully conceived. It must be able to provide a larger and comprehensive environment for data ingestion and management. Additionally, data lakes need to be able to offer support for self-service, in which a broad range of users, including those without high-level technical skills, have the tools to solve their own problems.

This research mission focuses on how companies can craft a vision that moves beyond the origins of most data lakes to what they now must do in order to be successful.

The Original Vision For The Data Lake

The original energy behind the data lake sprang up because Hadoop showed that it was possible to process and store big data in nearly limitless amounts in an affordable way, using new computing paradigms. The general idea was that companies could:

Later in the development of Hadoop and the data lake concept, another goal was added:

The problem is that these general concepts about data lakes never worked out to create mission-critical computing infrastructure for the majority of the enterprise. And data lakes have not radiated value the same way that data warehouses do, despite their flaws.

As pointed out above, this does not mean that valuable data pipelines weren’t created. In AdTech and for many web-scale company use cases, for example, MapReduce programs were created to deliver specific data sets that were used for reporting and analytics and to power apps.

The process of creating data pipelines was made simpler through technology Netflix Genie and Cascading and many other systems. Almost every ETL vendor such as Syncsort, Talend, and related vendors like Attunity figured out how to run their technology on a Hadoop cluster.

But the whole picture has not created a product in the sense that CRM, ERP, or data warehouses have become products.

Future installments of this research mission will focus more on what specific technologies succeeded in the first data lake era and what we learned about how the open source ecosystem works in a multi-vendor world with lots of innovation in related projects. But for now, we want to look at what went wrong and how to fix it.

Common Failure Modes of Data Lakes

Some common problems occurred during the first wave of data lake implementations. We need a name for this, and it seems natural, if a bit tired, to call this wave Data Lake 1.0.

By understanding what went wrong in Data Lake 1.0, we can better put forth a complete vision for Data Lake 2.0.

According to our research, including conversations with vendors and practitioners, there are a variety of failure modes that have been encountered so far.

Our distillation of the most common failure modes are presented below. (If you know of new ones or have comments on these, please reach out to us.)

Failure Mode: The Data Swamp

In Data Lake 1.0, the data swamp occurred when more and more data was added to the data lake and the methods for tracking it broke down, so only tribal knowledge guided people to the data they needed.

The data swamp failure mode occurs when a large amount of data is stored in the data lake but the naming conventions, folders, and other informal ways of keeping track of the files containing the data and where they came from break down.

The data lake becomes a messy file system and the wrong data is used or data cannot be found or is stored many times.

The remedy is a data catalog that:

Failure Mode: Dirty and Unorganized Data

In Data Lake 1.0, all sorts of new data arrived in widely varying states of organization and quality. The lack of self-service tools for inspecting, cleaning, and reorganizing the data slowed down the process of creating value.

It is always possible to figure out what is in a data set using advanced tools like R or Python, and then to use the same tools to clean the data up and extract important signals. When this process is too hard, you get a severely bottlenecked data lake that may have a huge amount of information that is trapped because too few people possess the skill set to understand it.

To make a data lake work, you must scale the team that can understand the data, clean it up, find important signals, and then package the data so that the data itself becomes a product.

The remedy is to use one of the latest generation of data prep tools address this need in the following ways:

With such data prep tools, as more data lands, more people can work on making it useful.

Failure Mode: Lack of Self-Service Analytics

In Data Lake 1.0, data was often exploited after data engineers, data scientists, or developers created a useful data set that was delivered to a broader set of users for use using familiar tools. The lack of self-service analytics tools to allow a wide range of people to explore big data added to the bottleneck.

Solving the data prep problem is part of taking away a bottleneck, but a second related challenge involves allowing the prepared data to be put to use in ways that can provide useful signals to business processes.

This is a tricky issue because there are a lot of moving parts. Here’s how it often gets solved.

Stage 1: Delivering data packages to existing BI and data discovery tools

It takes a long time for an organization to adopt and learn how to use a BI or data discovery tool like Tableau, Qlik, TIBCO Spotfire, or any other.

So it certainly makes sense that in the first stage of a data lake, packages of data with valuable signals are delivered to BI tools, which are already in front of users.

This model allows the data lake to become operationalized and can provide new value and new signals help a business increase awareness and make better decisions.

This model can work even better when users can use data prep tools to create their own data packages. When this happens, the real value of self-service starts to be unlocked.

Stage 2: Enabling Data Discovery inside the Data Lake

But a problem exists with the model in which data packages are delivered. It is two step dance that interrupts the ideal user-driven innovation process. As we will examine later, it also obscures some of the value of big data.

The most productive way to use the power of the data in a data lake is to use an analytics tool that allows direct access to that data. In this way, the detail and richness of the big data is still available during the exploration and discovery process. In the model in which data packages are sent to existing BI tools, the high resolution detail of the big data is not available.

As we will cover later, this model allows the analysts to be able to make more use of the big data, that is, to ask and answer big data questions.

In addition, the types of systems that allow analysts to have direct access to big data usually have a much wider set of analytics, AI and machine learning algorithms to make sense of big data so that new types of signals can be harvested.

Alternative Approaches to Preserving Direct Access to High Resolution Big Data

It is also possible to preserve the ability to use the high resolution data from a data lake in other ways.

Failure Mode: Custom-made Analytics with No Path to Operations

In Data Lake 1.0, great stuff was created to process data and find signals using advanced analytics. Often, however these jobs needed to be curated and run by the data science teams and could never go into production. When they did, there was not a clear model for maintenance and retraining of the algorithms.

Once you have created an environment for self-service, ideally, predictive models and other ways of finding important signals should be able to be packaged and made available for use.

Without such packaging:

The best situation is one in which:

Failure Mode: Lack of Tools for Packaging and Modeling to Create Data Products

In Data Lake 1.0, it often seemed that every effort started from scratch because previous projects had few ways to package data for reuse. The ability to create data models and publish data sets that reflected them was also lacking. Both capabilities support the productization of data which accelerates progress.

Data warehouses are often derided because they are slow to evolve and hard to change to meet new needs. But one of the reasons that data warehouses are this way is that often a huge amount of time is spent modeling the data.

When a data model is successful, it creates a commonly understood language for talking about the business.

Data lakes need data models for the same reason, to create a common language. When a data lake has no way to express a model, or to explain how a data set was created so that it can become commonly understood, it is as if each analyst is starting from scratch each time.

When a model can be created and gradually expanded, the result is a set of reusable data that can be the foundation for a common understanding and productivity.

Data lakes need ways to allow data to be packaged into reusable components, to have the model described, and eventually to have the data to become a way of understanding the business or meeting an important need. I call this level of quality, understanding, and usefulness a data product.

Failure Mode: Lack of Data Governance and Security

In Data Lake 1.0, often the idea was that everyone would have access to all the data. This imitated wide data access policies at companies like Google. But it didn’t take long for this model to break down and for governance and security to be needed. Often, Data Lake 1.0 capabilities were not sufficient.

Data lakes were often conceived as catch-all repositories. Lots of data started being stored there and not a lot of thought was given to data governance, security, and appropriate access. This sort of problem often happens in the early days of a data warehouse as well.

To really become part of an enterprise data infrastructure, a data lake must have a capability for controlling access and auditing usage of data.

In its most advanced form, such support includes distributed control of data sets so that the owners can define the terms of use through data sharing agreements that can be executed between parts of the organization when one group wants to use data belonging to another group.

Without this capability, unpleasant surprises often occur, where sensitive information is widely exposed. Alternatively, certain important classes of data are not able to become part of the data lake.

Failure Mode: Sticking with Small Data Thinking and Practices

In Data Lake 1.0, the data lake has sometimes been used as if it is a data warehouse because people both did not understand the power of big data and also did not have tools to explore the data directly.

When everyone using a data lake is constrained by small data thinking, a data lake becomes far less useful than it should be.

In essence, small data thinking means using a data lake as if it is a data warehouse. In such a scenario, many of the following things may occur:

Failure Mode: Perpetual Proof of Concept

In Data Lake 1.0, for many of the reasons cited, often valuable data pipelines never found their way to reliable, supported production maintained by an operations staff.

Too often data pipelines created in a data lake are not easily moved into production and supported:

Failure Mode: The “One Data Lake to Rule Them All” Vision

In Data Lake 1.0, the legitimate excitement about having a huge, affordable repository led to the idea that all needed data could be stored in one place. Even if that could be done, there will always be new repositories outside the data lake that need to be included. The vision sometimes reduced focus on providing federated access to many repositories, a needed capability in most companies.

The ability of the data lake to store all sorts of data at volume led to a renewed interest in one of the holy grail goals of the data warehouse: To have one repository to rule them all.

While the idea of having all data available in one repository is alluring, as a practical matter it is almost impossible.

So, the goal for having a data lake with all available data becomes a failure mode when it prevents or distracts focus from the implementation of federated access in some form to a wide variety of repositories that haven’t yet found their way to the data lake.

Data Lake 2.0: A Vision for Building the Best Possible Data Lake

Since the data lake was first conceived, the thinking about what it could become has gotten more precise and detailed. I believe that enough new ideas have emerged since the initial excitement about putting big data to work using Hadoop that we can legitimately talk about a new more complete vision for a data lake. We might as well call it Data Lake 2.0 because everyone will know what we mean.

The vision we present for Data Lake 2.0 is a catchall of new capabilities that are being implemented to extend the power of the data lake. It would never make sense to implement all the capabilities we describe; indeed some of them are contradictory or only serve special needs. But reviewing the list should give even the most experienced student of the data lake concept some new ideas.

Data Lake 2.0 Capabilities That Break New Ground

The capabilities we are going to describe all break new ground in that they were generally not considered in a mature form in the initial vision and implementations of the data lake, data lake 1.0.

Data Lake 2.0 Capability: Tracking Data Usage to Support Agile Data Productization Process

Data Lake 2.0 should include a way to monitor usage so that “market” signals can be used to solve problems and guide investment.

When data lakes become an important center of activity, it is vital to track usage so that it is possible to understand what data is becoming important. With usage data, many things are possible such as:

Data Lake 2.0 Capability: Federated Tracking and Access to Repositories Outside the Data Lake

Data Lake 2.0 should include the ability to track and access data in repositories outside the data lake.

Even if you want “one repository to rule them all,” it will take a while to get there. And there will always be data in repositories that you probably don’t want to move into a data lake. So, it is vital to have the ability to:

Data Lake 2.0 Capability: Automatic Metadata Capture (Including Usage Data) for Data Catalog

Data Lake 2.0 should be able to capture as much metadata as possible about data sets loaded into it or data sets that are being tracked.

Data Lake 2.0 Capability: Automatic Data Profiling for Data Catalog

Data Lake 2.0 should have data profiling information automatically added to the data catalog.

Data Lake 2.0 Capability: Automated Semantic Linking Based in AI and ML Analysis

Data Lake 2.0 should have the ability to analyze the content and semantics of data sets in order to find relationships between them.

Data Lake 2.0 Capability: Collaborative, Curated Data Catalog

Data Lake 2.0 should have a data catalog that allows human curation and supports collaboration.

Data Lake 2.0 Capability: Support for Streams, NoSQL, Graph, and Other Repositories

Data Lake 2.0 should support a wide variety of repositories to deliver and operationalize the data.

Data Lake 2.0 Capability: Support for Data Fabric Access

Data Lake 2.0 should allow a portion of the data in the data lake to be used through a data fabric as an operational data set to support applications and use of data, and not just exploration and data science.

Data Lake 2.0 Support for Advanced, Big Data SQL

Data Lake 2.0 should provide access to the most advanced big data SQL engines and the extended capabilities they offer.

Data Lake 2.0 Support for Interactive Big Data Analytics

Data Lake 2.0 should provide analysts with direct access to the big data with a set of high power self-service tools.

Data Lake 2.0 Support for Packaged, Interactive ML and AI

Data Lake 2.0 should allow self-service users, as well as data scientists and analysts, to employ packaged ML and AI capabilities.

Data Lake 2.0 Support for OLAP-style Data Cubes

Data Lake 2.0 should allow existing data cubes to be migrated into the data lake and for new data cubes to be developed for companies with a large investment in this technology.

Data Lake 2.0 Support for Replication, CDC, and Other Capabilities to Support a Data Supply Chain

Data Lake 2.0 should recognize that it is part of a data supply chain and support capabilities for data logistics and data engineering so that data can be easily, reliably, and securely moved around a data supply chain at high speed.

Data Warehouse Capabilities for Data Lakes to Inherit

These capabilities represent a partial list of those that have been developed to a high degree of maturity in the data warehouse. It would be useful to replicate them in Data Lake 2.0. Please suggest others.

Capabilities Best Implemented in a Data Warehouse

These capabilities are best implemented in a data warehouse:

New Capabilities to Address Data Lake 1.0 Failure Modes

It is not clear what would fall into this category, but we suspect that some distillation of data lake 1.0 capabilities would represent a new vision.