Saving Your Data Lake

The Gist of the Mission

I first heard the term data lake in 2011 at a meetup where James Dixon, CTO and founder of Pentaho, which is now a part of Hitachi Vantara, explained the concept. At that point, Dixon used the term data lake to explain what was needed to allow management and analysis of big data now that Hadoop had arrived and made it affordable (see “Big Data Requires a Big, New Architecture”).

Since then Hadoop has come and gone as the star of the big data show, but the idea of the data lake has persisted for the reasons Dixon pointed out. We need a way to store large amount of data, manage it and make it accessible, process it in new ways, and deliver the results in production.

Since the start of this journey the notion of hard-to-code MapReduce batch jobs has morphed into the superior YARN framework, Spark has risen as an important new processing platform that is much more application friendly than Hadoop, and the cloud providers are all busy at work explicitly creating platforms to support what I call data supply chains.

As I pointed out in this article in 2013, Hadoop was on its way to becoming an embedded system. That process is now complete. The Hadoop vendors, Cloudera, Hortonworks, and MapR don’t talk about Hadoop any more. They are all busy attempting to build larger product visions around core capabilities for big data processing. Hadoop is in there somewhere, but the elephant is nowhere to be seen.

That’s where we are now stuck. The latest wave of data lake implementations has not been broadly successful at making big data an asset that informs a wide swath of business decision-making. There have been major victories with respect to using Hadoop to build data supply chains to provide high value data to applications and to create certain types of data marts. And many new applications of AI are continuing to provide value using special purpose data supply chains.

But being able to build, with great effort, advanced data pipelines is not what we want or need from a data lake. What we need is a way to store and manage big data and allow as many people as possible to access it, distill it, analyze it, and find important signals in it.

A clear vision of a data lake that is as well defined and operationalized as the data warehouse has eluded us.

This research mission is about defining a complete vision for a data lake, one that has all the missing pieces that were not included in the first wave of implementation.

Our goal is to find a way to take a data lake from a stalled or unproductive state to playing a central role in making big data useful to support high value use cases that have a major impact on business results.

The sustainable advantage that this research mission seeks to describe is the full implementation of the data lake, which performs as originally intended and overcomes some of the challenges encountered by businesses during the first wave of data lake implementation.

In our view, the data lake remains an excellent idea to solve enterprise data needs, but only if it is more fully conceived. It must be able to provide a larger and comprehensive environment for data ingestion and management. Additionally, data lakes need to be able to offer support for self-service, in which a broad range of users, including those without high-level technical skills, have the tools to solve their own problems.

This research mission focuses on how companies can craft a vision that moves beyond the origins of most data lakes to what they now must do in order to be successful.

The Original Vision For The Data Lake

The original energy behind the data lake sprang up because Hadoop showed that it was possible to process and store big data in nearly limitless amounts in an affordable way, using new computing paradigms. The general idea was that companies could:

Affordably store the growing amount of big data becoming available from machine data sources (logs and other digital exhaust) and all sorts of sensors (IoT)
Imitate the successes of the web-scale Internet companies by using new computing paradigms (MapReduce) to extract valuable signals.
Use new methods of data management (such storing flat files, JSON, Parquet, and Avro) and new modes of data processing (schema on read) to expand the reach of the data lake to new types of data.

Later in the development of Hadoop and the data lake concept, another goal was added:

Using data lakes as a cheaper, more flexible alternative for certain ETL and batch workloads running on expensive data warehouses

The problem is that these general concepts about data lakes never worked out to create mission-critical computing infrastructure for the majority of the enterprise. And data lakes have not radiated value the same way that data warehouses do, despite their flaws.

As pointed out above, this does not mean that valuable data pipelines weren’t created. In AdTech and for many web-scale company use cases, for example, MapReduce programs were created to deliver specific data sets that were used for reporting and analytics and to power apps.

The process of creating data pipelines was made simpler through technology Netflix Genie and Cascading and many other systems. Almost every ETL vendor such as Syncsort, Talend, and related vendors like Attunity figured out how to run their technology on a Hadoop cluster.

But the whole picture has not created a product in the sense that CRM, ERP, or data warehouses have become products.

Future installments of this research mission will focus more on what specific technologies succeeded in the first data lake era and what we learned about how the open source ecosystem works in a multi-vendor world with lots of innovation in related projects. But for now, we want to look at what went wrong and how to fix it.

Common Failure Modes of Data Lakes

Some common problems occurred during the first wave of data lake implementations. We need a name for this, and it seems natural, if a bit tired, to call this wave Data Lake 1.0.

By understanding what went wrong in Data Lake 1.0, we can better put forth a complete vision for Data Lake 2.0.

According to our research, including conversations with vendors and practitioners, there are a variety of failure modes that have been encountered so far.

Our distillation of the most common failure modes are presented below. (If you know of new ones or have comments on these, please reach out to us.)

Failure Mode: The Data Swamp

In Data Lake 1.0, the data swamp occurred when more and more data was added to the data lake and the methods for tracking it broke down, so only tribal knowledge guided people to the data they needed.

The data swamp failure mode occurs when a large amount of data is stored in the data lake but the naming conventions, folders, and other informal ways of keeping track of the files containing the data and where they came from break down.

The data lake becomes a messy file system and the wrong data is used or data cannot be found or is stored many times.

The remedy is a data catalog that:

Provides a clear process for onboarding data to the data lake.
Automatically extracts as much metadata as possible.
Profiles the data to allow it to be quickly assessed.
Tracks data lineage over time.
Allows access to be controlled.
Allows data to be managed according to quality and maturity: Gold, Silver, Bronze.
Supports search.
Supports collaboration.
Makes recommendations about related data sets.

Failure Mode: Dirty and Unorganized Data

In Data Lake 1.0, all sorts of new data arrived in widely varying states of organization and quality. The lack of self-service tools for inspecting, cleaning, and reorganizing the data slowed down the process of creating value.

It is always possible to figure out what is in a data set using advanced tools like R or Python, and then to use the same tools to clean the data up and extract important signals. When this process is too hard, you get a severely bottlenecked data lake that may have a huge amount of information that is trapped because too few people possess the skill set to understand it.

To make a data lake work, you must scale the team that can understand the data, clean it up, find important signals, and then package the data so that the data itself becomes a product.

The remedy is to use one of the latest generation of data prep tools address this need in the following ways:

Provide a guided process for understanding what is in a dataset.
Provided a guided process for data transformation enabled by a large collection of transformation functions, ways to create lookup and transformation tables, and other ways to transform data.
Allow use both by super users who need the help, as well as data scientists and developers who want to get under the hood and create complex transformations in the same environment.
Enable these complex transformations created by data scientists and developers to become user defined functions that can become part of the guided process.
Capture the definition of a transformation to create a job that can process a data pipeline that can run automatically as new data arrives.
Allow data pipelines consisting of multiple jobs to be defined.
Integrate with other parts of the data lake such as the data catalog.

With such data prep tools, as more data lands, more people can work on making it useful.

Failure Mode: Lack of Self-Service Analytics

In Data Lake 1.0, data was often exploited after data engineers, data scientists, or developers created a useful data set that was delivered to a broader set of users for use using familiar tools. The lack of self-service analytics tools to allow a wide range of people to explore big data added to the bottleneck.

Solving the data prep problem is part of taking away a bottleneck, but a second related challenge involves allowing the prepared data to be put to use in ways that can provide useful signals to business processes.

This is a tricky issue because there are a lot of moving parts. Here’s how it often gets solved.

Stage 1: Delivering data packages to existing BI and data discovery tools

It takes a long time for an organization to adopt and learn how to use a BI or data discovery tool like Tableau, Qlik, TIBCO Spotfire, or any other.

So it certainly makes sense that in the first stage of a data lake, packages of data with valuable signals are delivered to BI tools, which are already in front of users.

This model allows the data lake to become operationalized and can provide new value and new signals help a business increase awareness and make better decisions.

This model can work even better when users can use data prep tools to create their own data packages. When this happens, the real value of self-service starts to be unlocked.

Stage 2: Enabling Data Discovery inside the Data Lake

But a problem exists with the model in which data packages are delivered. It is two step dance that interrupts the ideal user-driven innovation process. As we will examine later, it also obscures some of the value of big data.

The most productive way to use the power of the data in a data lake is to use an analytics tool that allows direct access to that data. In this way, the detail and richness of the big data is still available during the exploration and discovery process. In the model in which data packages are sent to existing BI tools, the high resolution detail of the big data is not available.

As we will cover later, this model allows the analysts to be able to make more use of the big data, that is, to ask and answer big data questions.

In addition, the types of systems that allow analysts to have direct access to big data usually have a much wider set of analytics, AI and machine learning algorithms to make sense of big data so that new types of signals can be harvested.

Alternative Approaches to Preserving Direct Access to High Resolution Big Data

It is also possible to preserve the ability to use the high resolution data from a data lake in other ways.

Provide big data detail to a data warehouse. In this model you move the data sets to a data warehouse that can handle the volume of data. Many of the cloud-based or upgraded data warehouses can chew through huge amounts of data.
Use a special purpose data warehouse that is built for big data. There are a variety of data warehouses that have been built to access big data using various types of optimizations.
Using a data search or analysis tool that allows big data to be searched and distilled directly. There are a variety of tools that can search and distill data in a data lake in various ways.

Failure Mode: Custom-made Analytics with No Path to Operations

In Data Lake 1.0, great stuff was created to process data and find signals using advanced analytics. Often, however these jobs needed to be curated and run by the data science teams and could never go into production. When they did, there was not a clear model for maintenance and retraining of the algorithms.

Once you have created an environment for self-service, ideally, predictive models and other ways of finding important signals should be able to be packaged and made available for use.

Without such packaging:

The use of AI, ML, and advanced analytics can be bottlenecked inside data science teams.
It becomes virtually impossible to achieve mass adoption of AI, ML, and advanced analytics.

The best situation is one in which:

The data science teams have the ability to promote jobs into production and participate in an AnalyticsOps sort of process where they can train the operations team to run models and detect signals that indicate retraining may be needed.
Once models are stable and well understood and have a solid operations process, they should be able to be packaged and applied more widely to many different uses as needed by business processes.

Failure Mode: Lack of Tools for Packaging and Modeling to Create Data Products

In Data Lake 1.0, it often seemed that every effort started from scratch because previous projects had few ways to package data for reuse. The ability to create data models and publish data sets that reflected them was also lacking. Both capabilities support the productization of data which accelerates progress.

Data warehouses are often derided because they are slow to evolve and hard to change to meet new needs. But one of the reasons that data warehouses are this way is that often a huge amount of time is spent modeling the data.

When a data model is successful, it creates a commonly understood language for talking about the business.

Data lakes need data models for the same reason, to create a common language. When a data lake has no way to express a model, or to explain how a data set was created so that it can become commonly understood, it is as if each analyst is starting from scratch each time.

When a model can be created and gradually expanded, the result is a set of reusable data that can be the foundation for a common understanding and productivity.

Data lakes need ways to allow data to be packaged into reusable components, to have the model described, and eventually to have the data to become a way of understanding the business or meeting an important need. I call this level of quality, understanding, and usefulness a data product.

Failure Mode: Lack of Data Governance and Security

In Data Lake 1.0, often the idea was that everyone would have access to all the data. This imitated wide data access policies at companies like Google. But it didn’t take long for this model to break down and for governance and security to be needed. Often, Data Lake 1.0 capabilities were not sufficient.

Data lakes were often conceived as catch-all repositories. Lots of data started being stored there and not a lot of thought was given to data governance, security, and appropriate access. This sort of problem often happens in the early days of a data warehouse as well.

To really become part of an enterprise data infrastructure, a data lake must have a capability for controlling access and auditing usage of data.

In its most advanced form, such support includes distributed control of data sets so that the owners can define the terms of use through data sharing agreements that can be executed between parts of the organization when one group wants to use data belonging to another group.

Without this capability, unpleasant surprises often occur, where sensitive information is widely exposed. Alternatively, certain important classes of data are not able to become part of the data lake.

Failure Mode: Sticking with Small Data Thinking and Practices

In Data Lake 1.0, the data lake has sometimes been used as if it is a data warehouse because people both did not understand the power of big data and also did not have tools to explore the data directly.

When everyone using a data lake is constrained by small data thinking, a data lake becomes far less useful than it should be.

In essence, small data thinking means using a data lake as if it is a data warehouse. In such a scenario, many of the following things may occur:

Use of data remains highly intermediated, with users not having access to or not using self-service tools.
Only small-data tools such as hand-crafted queries are used to examine data. Advanced algorithms and other techniques are not used.
Small data questions are asked, not big data questions that use the high resolution signals coming from big data. (An example of a small data question is: what were our sales last month? An example of a big data question is: what is our expected churn for customers next year, and which high-value customers are at risk?)

Failure Mode: Perpetual Proof of Concept

In Data Lake 1.0, for many of the reasons cited, often valuable data pipelines never found their way to reliable, supported production maintained by an operations staff.

Too often data pipelines created in a data lake are not easily moved into production and supported:

To meet the challenges mentioned so far often means the data pipelines created are custom jobs. This makes moving them to production and allowing the jobs to be run by an operations team difficult. As a result, the jobs stay with the teams that created them, who run them when needed, diluting their time to do more analysis.
When AI, machine learning, and advanced analytics are part of data pipelines, it is often not clear how to send them to production in a way that allows the sort of continuous monitoring that is needed to indicate when retraining or other adjustments are needed.
When a data pipeline creates an extract that is then used in a data discovery or BI tool, there should be a clear way for the user to understand the lineage of that extract and ask for changes.

Failure Mode: The “One Data Lake to Rule Them All” Vision

In Data Lake 1.0, the legitimate excitement about having a huge, affordable repository led to the idea that all needed data could be stored in one place. Even if that could be done, there will always be new repositories outside the data lake that need to be included. The vision sometimes reduced focus on providing federated access to many repositories, a needed capability in most companies.

The ability of the data lake to store all sorts of data at volume led to a renewed interest in one of the holy grail goals of the data warehouse: To have one repository to rule them all.

While the idea of having all data available in one repository is alluring, as a practical matter it is almost impossible.

So, the goal for having a data lake with all available data becomes a failure mode when it prevents or distracts focus from the implementation of federated access in some form to a wide variety of repositories that haven’t yet found their way to the data lake.

Data Lake 2.0: A Vision for Building the Best Possible Data Lake

Since the data lake was first conceived, the thinking about what it could become has gotten more precise and detailed. I believe that enough new ideas have emerged since the initial excitement about putting big data to work using Hadoop that we can legitimately talk about a new more complete vision for a data lake. We might as well call it Data Lake 2.0 because everyone will know what we mean.

The vision we present for Data Lake 2.0 is a catchall of new capabilities that are being implemented to extend the power of the data lake. It would never make sense to implement all the capabilities we describe; indeed some of them are contradictory or only serve special needs. But reviewing the list should give even the most experienced student of the data lake concept some new ideas.

Data Lake 2.0 Capabilities That Break New Ground

The capabilities we are going to describe all break new ground in that they were generally not considered in a mature form in the initial vision and implementations of the data lake, data lake 1.0.

Data Lake 2.0 Capability: Tracking Data Usage to Support Agile Data Productization Process

Data Lake 2.0 should include a way to monitor usage so that “market” signals can be used to solve problems and guide investment.

When data lakes become an important center of activity, it is vital to track usage so that it is possible to understand what data is becoming important. With usage data, many things are possible such as:

Determining how to allocate investment in improving, packaging, and creating new models based on what people are using.
Creating a lifecycle for data such as bronze, silver, and gold strata that is based on what data is most popular.
Accelerating the widespread adoption of new datasets that come to the attention of key groups of users.
Creating a defined agile process for managing the data lake.

Data Lake 2.0 Capability: Federated Tracking and Access to Repositories Outside the Data Lake

Data Lake 2.0 should include the ability to track and access data in repositories outside the data lake.

Even if you want “one repository to rule them all,” it will take a while to get there. And there will always be data in repositories that you probably don’t want to move into a data lake. So, it is vital to have the ability to:

Track and harvest metadata about data that is outside of the data lake.
Have the ability to help create a query and join that data with data inside the data lake.

Data Lake 2.0 Capability: Automatic Metadata Capture (Including Usage Data) for Data Catalog

Data Lake 2.0 should be able to capture as much metadata as possible about data sets loaded into it or data sets that are being tracked.

Most catalogs try to get as much as possible from the schema and column names.
Being able to add usage data can transform a data lake into a data marketplace where signals about usage can be used in a variety of ways.
The better and deeper the information in the catalog, the easier it is to guide people to the right data.
A strong, accurate catalog helps with automatic construction of queries.

Data Lake 2.0 Capability: Automatic Data Profiling for Data Catalog

Data Lake 2.0 should have data profiling information automatically added to the data catalog.

A quick set of standard profiling information can really help an analyst or data scientist determine if a data set is right for him or her.
Providing data profiling information in advance of the inspection of the data, rather than making someone do it when they need the data, dramatically increases the amount the data that is used because of the improved convenience.
Data profiling can be adapted and have different standard forms based on different types of data.

Data Lake 2.0 Capability: Automated Semantic Linking Based in AI and ML Analysis

Data Lake 2.0 should have the ability to analyze the content and semantics of data sets in order to find relationships between them.

The metadata approach to creating a catalog has significant weaknesses because the metadata may be limited or incorrect, so having the ability to look at the semantics of the data using machine learning often allows much more to be revealed about what content and signals are stored in the data.
By adding this information to the existing data catalog, it’s possible to make a much more powerful and accurate data catalog.
In addition, when searching for data that may not have been the focus of earlier cataloging efforts or that is required for regulatory compliance, being able to crawl the entire data set and find relationships can make your entire data lake more useful.

Data Lake 2.0 Capability: Collaborative, Curated Data Catalog

Data Lake 2.0 should have a data catalog that allows human curation and supports collaboration.

In any community that’s using data, people will find golden nuggets or low quality data. Like a restaurant review online, people should be able to make comments on the positive or negative aspects about a particular data set.
In addition, the ability to make suggestions or comments or include other people in discussions of data, share queries or extracts, or other ways of using data should be a part of the catalog.
Allowing human input to be captured makes the data lake a repository of tribal knowledge.

Data Lake 2.0 Capability: Support for Streams, NoSQL, Graph, and Other Repositories

Data Lake 2.0 should support a wide variety of repositories to deliver and operationalize the data.

Data lakes are all about refining and distilling and creating high quality subsets of data that are rich in signals and can be of value to the business. Depending upon the context, these may be delivered as a data set or in various formats and repositories, such as streams, NoSQL, Graph, and so on.
In the most advanced form of a data lake, a single data set can be delivered in a multi-modal fashion so it could be accessed as a stream, NoSQL, graph, or other form depending on the need.

Data Lake 2.0 Capability: Support for Data Fabric Access

Data Lake 2.0 should allow a portion of the data in the data lake to be used through a data fabric as an operational data set to support applications and use of data, and not just exploration and data science.

Data fabrics are a construct that were developed in response to the fact that applications need a wide variety of services when accessing data, not just the ability to retrieve and store data.
The data fabric, providing operational access, must have the performance characteristics needed to support high performance applications, as well as high performance techniques like caching, indexing, etc.
Data fabrics often are the way that multi-modal access is provided for applications and other transactional uses.
In its most advanced form, a data fabric also supports global multi-repository architectures through replication and transactionality that spans all repositories.

Data Lake 2.0 Support for Advanced, Big Data SQL

Data Lake 2.0 should provide access to the most advanced big data SQL engines and the extended capabilities they offer.

A variety of SQL engines have been developed to provide access to tabular data that is stored in Hadoop and other object repositories. In many cases, a data lake is most often used with simpler SQL queries against big data.
The engines that deliver these highly scalable SQL capabilities also often act as query federators that extend the power of the data lake to collect and integrate data from many different repositories outside the data lake.

Data Lake 2.0 Support for Interactive Big Data Analytics

Data Lake 2.0 should provide analysts with direct access to the big data with a set of high power self-service tools.

One of the failure modes of data lakes has been losing signals by always distilling data before providing it to an analyst. The big data in a data lake is more powerful when analysts can get the data at a highly granular level. This, along with wide range of data in the data lake, allows them to ask big data questions.
By providing direct access to the data in the data lake, you vastly increase the number of people who can discover important signals and package data for use by others.
Advanced algorithms and statistics work far better when given access to the most detailed level of data.

Data Lake 2.0 Support for Packaged, Interactive ML and AI

Data Lake 2.0 should allow self-service users, as well as data scientists and analysts, to employ packaged ML and AI capabilities.

Both in batch and interactive jobs, once a model has been figured out and proven useful, it should be easy to use, and an advanced data lake must support that capability.

Data Lake 2.0 Support for OLAP-style Data Cubes

Data Lake 2.0 should allow existing data cubes to be migrated into the data lake and for new data cubes to be developed for companies with a large investment in this technology.

Data cubes provide a hugely valuable service in creating both a model of data that reflects a common understanding of a business and also a way to execute queries with high performance.
While OLAP technology has been around for decades, it is also quite widely used and popular, and the best data lake will allow this method to be supported so that existing patterns of use, investment, training, and modeling can be exploited.

Data Lake 2.0 Support for Replication, CDC, and Other Capabilities to Support a Data Supply Chain

Data Lake 2.0 should recognize that it is part of a data supply chain and support capabilities for data logistics and data engineering so that data can be easily, reliably, and securely moved around a data supply chain at high speed.

A data lake should either support or have access to enterprise-class capabilities for replication, change-data capture, and consumption and publication of streaming data.

Data Warehouse Capabilities for Data Lakes to Inherit

These capabilities represent a partial list of those that have been developed to a high degree of maturity in the data warehouse. It would be useful to replicate them in Data Lake 2.0. Please suggest others.

Governed data lab/data sandbox
Data quality

Capabilities Best Implemented in a Data Warehouse

These capabilities are best implemented in a data warehouse:

Production reporting
Use of SQL for many small data tables
Use of SQL for complex queries
Use of SQL in production for small data workloads

New Capabilities to Address Data Lake 1.0 Failure Modes

It is not clear what would fall into this category, but we suspect that some distillation of data lake 1.0 capabilities would represent a new vision.