Data Lake Failure Modes – Insights from Podium Data
To support the research mission Saving Your Data Lake, we’re speaking with various companies to capture their perspectives on why and how data lakes fail. During our conversations with Podium Data, we discussed three main failure modes of data lakes:
Polluted data lakes
A polluted data lake occurs when many pilot projects with many tools are unleashed. This leads to lots of experimentation, but nothing that can scale to production. A plethora of tools and projects were focused on the data lake, but that resulted in mounds of data without an organizing structure, metadata, and quality control processes. This meant that it became difficult to find data, or understand its utility or purpose.
Bottlenecked data lakes
Other companies struggled because they treated data lakes as next-generation data warehouse technology rather than an entirely new approach to data. The data lake was carefully curated, with engineers involved in every decision. That leads to a bottleneck. While this team can produce results, it cannot do so at scale and the process is as slow as ever. Because many common and open source data management tools do not meet enterprise demands with regards to load management and data quality checks at ingest, companies created a worse data warehouse experience as they moved to the data lake.
Risky data lakes
In an effort to provide access to the data lake quickly for data scientists, some companies have not ensured that policies were applied to sensitive data such as PII. Enterprises need standards and policies to enforce security around masking and managing sensitive data so that it is not exposed to the wrong internal users — or even worse, external ones. But the lack of enterprise class data management tools with data lakes has led to some companies not monitoring, auditing, and controlling access to their data as they should. As a result, they can’t use the data lake at scale.