Data Lake Failure Modes – Insights from Arcadia Data
To support the research mission Saving Your Data Lake, we’re speaking with various companies to capture their perspectives on why and how data lakes fail. Arcadia Data has experience using data inside the data lake in a new way that allows analysts to interact directly with data in what they call a data native mode to create and share analytics, enabling them to ask and answer big data questions. During our conversations with Arcadia Data, we discussed five main reasons data lakes fail:
Staying in a low resolution world
Part of Arcadia’s Data’s data native philosophy is that you have to have all your data, in all its granularity, in the data lake and available to users. You need your data available and brought together with a fully detailed view so that the source data can be used, allowing users to ask big data questions. These have far more value than previous generation of questions when all you could ask were small data questions. (An example of a small data question is: what were our sales last month? an example of a big data question is: what is our expected churn for customers next year?) But many companies don’t recognize that this increased granularity is possible with the data lake and how that can bolster more powerful analytics. So they remain in a low resolution world. It’s like buying an HDTV and then watching grainy YouTube videos on it — you need to fully utilize the power of the data lake technology to get the most out of it.
Keeping elements of small data architecture
Many companies fail to take advantage of the data lake by not introducing new architecture to support it. They continue to have small data chokepoints and also preserve small data patterns of use. By doing so, they’re not taking advantage of the flexibility of the data lake, where you can apply schemas later, with schemas on read instead of schemas on write. Data lakes thus fail because companies continue to live in federated query world.
Using data lakes in the background
Another failure mode companies experience with the data lake is when they move distilled data curated out of the data lake back into the data warehouse. Once they do so, they’re limiting the reach of that data, as users of the data warehouse are locked back in a small data world. Users don’t have the ability to explore data or have a data native view of it within the data warehouse. It greatly limits the way users can interact with data. It’s like taking a photo of a 3D model – a lot is lost in translation.
Failure to use data lakes for business impact
Often, companies will generate some valuable new insights from the data in and flexibility of the data lake. But just as often, the people on the front lines are not able to operationalize those insights, and the behavior of the business stays the same. The last mile of the data lake process is thus disconnected from the rest of it. For data lakes to matter, they must change the way businesses operate.
Imposing a curiosity tax on users
The final failure mode of data lakes that Arcadia Data pointed to was how some companies impose a curiosity tax on data lake users. This occurs when someone else owns the data in the data lake, and users cannot directly work on it themselves. Or, it can also happen when the tools analysts are most comfortable using are not available in the data lake. Users thus face a tax that circumscribes their curiosity — over time, they end up not wanting to ask big data questions because of the hassles involved.