Why Analysts Need Hands-On Access to Big Data: The Case for the Data-Native Approach
We’ve now gone down the road with big data and data lakes and the rise of AI and machine learning powered techniques long enough that it’s appropriate to take a step back and ask a few basic questions about what we’re trying to do and how we’re trying to do it.
What’s interesting to me, based on some recent work I’ve done with Arcadia Data to create this white paper is that a new generation of tools is allowing analysts to have more intimacy with larger and larger data sets. The question is, does it really matter if an analyst can directly interact with and hammer away at a truly massive data set or is there really little to be gained from having such direct access?
Arcadia Data and other companies I’ve worked with in the past, such as 1010Data, argue that it’s vital for analysts to be able to have the whole set of data at their fingertips because without such access, important information is lost. Arcadia calls this the data-native approach. With a data-native approach, business users have direct, hands-on access to big data, which is different than what they’ve had in the past.
In this article, I’m going to take a close look at how to define a data-native approach and what benefits may accrue from allowing analysts such access.
What Is a Data-Native Approach?
A data-native approach means that analysts have direct access to methods that allow them to ask and answer questions from big data. To do this, the data itself, and the means to process and analyze it, must be in one place.
The opposite of a data-native approach is the typical small data scenario. Companies attempt to use small data tools (traditional BI or spreadsheets) on extracts from their data lakes. The problem is that the data is shrunk down and loses resolution. It also is highly intermediated, so it is not responsive or able to be used in real-time.
A data-native approach represents a new standard for taking advantage of all of the data in the data lake that preserves the detail and signal in the data that allows big data questions to be answered.
In a data-native approach:
- You analyze the data in place: Taking big data from the data lake and using small data tools on it is like taking a photo of a 3D model: you lose resolution. Instead, you need to be able to analyze the data wherever it resides.
- You empower business users to add meaning. Non-technical users can give a business-oriented meaning and context to the underlying data; that’s called semantic modeling.
- You view and integrate the data across all endpoints and users: This allows you to see all the angles of your data, including who is using it, across all networks, and endpoints, down to individual log files.
- You operationalize the data without lots of extra work: You get faster, more agile business impact by combining deep data from the data lake informed by business context, enabling timely action. Better dashboards aren’t just used at the end of the process, but throughout it, and are designed with the end user in mind.
What Arcadia Data argues is that a data-native approach essentially allows you to ask better questions — what they call big data questions — rather than using tools that occlude the important signals and details and only allow analysts to ask and answer only small data questions.
Let’s take a look at what this looks like in practice.
Analyze the Data in Place
With small data, analysts have been accustomed to hammering away at data in a relational database or other repository, and then craft queries to explore the data. With this standard set up, you then get a view back from each query. You’re able to know the data a little better and determine if it needs to be cleaned up to be of further use.
That simply won’t work in the era of big data. SQL queries can summarize data but big data is often not available in a structured form to be used with SQL. Additionally, such a process is highly intermediated. To make big data effective and use it in real-time, you can’t rely on data scientists to create batch jobs every time you want to do something because then you’re in a highly unresponsive, bogged-down state.
Big data demands urgency. It needs to be used while it is fresh. That means that analysts and business users need to be able to work on the data directly with new tools. No SQL will allow them to do this. And in reality, there’s no single big data technology that allows a SQL approach to be taken to big data.
Instead, companies need technology to extract signals that rely on algorithms, machine learning, and AI. Using these new capabilities mean that companies have the faster cycle to signal discovery that’s necessary to allow business users to work on data in real-time.
If you adopt a process that allows you to work on data in place, you can avoid small data mistakes. You avoid making mistakes where you summarize a signal out of data. Think of it this way — with an online shopping cart, if you were analyzing abandoned carts, you might see that the average number of items in an abandoned cart was around 5. But what’s more interesting is to know there might be a cluster of carts with 3 times and another with 7. That provides far more insight. You want to be able to leverage the level of detail big data can provide.
Empower Business Users to Make Sense of Data and Create Data Packages and Products
Using these new AI-powered technologies and algorithmic techniques, you can often quickly find results that tell you there is something unusual or that detect patterns in your big data. But, while the algorithms can identify these abnormalities, they generally can’t determine if these signals are useful and meaningful. Only a human business user can do this well.
Thus, that’s the level where the business user can provide the most value and map where future queries or analytics should be modeled. Once the business user offers this type of context, eventually, these signals and integrations of data become new layers that are the foundation for new progress. For example, you can assemble all information for a certain product or customer, and then create high resolution data models and objects, which can then, in turn, be distilled for a variety of analytics.
It is really important that business users are able to package and create data layers that are reusable and connected to semantic model of business to get the most value from big data. Over time, you can develop a standard way of looking at the business with big data and not just small data.
View Data Across All Endpoints and Users and Timeframes
One of the most powerful aspects of big data is that you get lots of high resolution views of the business. And this speaks to the hands on aspect of the data-native approach. Once you have big data repositories, you usually have more data than ever before coming in in all different forms. That can be both structured and unstructured data, and data that is real-time or streaming.
The high resolution views made possible by this data need to be made sense of by business users. For this to happen, you have to be able to integrate data across users, endpoints, and timeframes without losing fidelity. You need a technology that can weave all of this data together and create models based on these connections.
If you have a process for handling this that is intermediated by data engineers or scientists, you lose the power of big data. With a data-native approach, analysts can do this integration and analysis on their own as much as possible.
Now, there’s a layer of data engineering that will always be required and I’m not downplaying the expertise that is needed in certain situations. But you want to get to a state where analysts are doing the majority of the work. Then you can have a big data model that allows you to answer big data questions and what you can ask and answer becomes richer.
Operationalize the Data
Once you have integrated data and analyzed it, you must be able to push the insights into production. If you have to redo something to get it into production, you lose power. You want to be able to use the same tools that you relied on to create and integrate the big data models, to then operationalize them. Hand-offs make things more intermediated, which again, is the enemy of making big data effective.
Thus a key characteristic of the data-native approach is that all your tools can be operationalized. Once something is operational, it might still need to be optimized. But again, this should all be possible within you system and is so in a data-native approach.
The Benefits of the Data-Native Approach
The main benefit of the data-native approach is that you get the intimacy with the data so you don’t lose signal. This is crucial. It’s really important to extract these signals so you get value and not just treat big data like it’s small data.
An additional benefit is that you get self-service to accelerate innovation both for analytics and integration.
You can also build wealth by creating reusable layers that become big data models for the entire business.
Ultimately, you build institutional wealth by having more and more of this data put into production and used in the context of the business. Creating these models allows all of this to evolve at all levels. This is what a data-native approach gives you.
When Doesn’t Data-Native Matter?
Why does this all matter? If you aren’t getting new signals or need high resolution models (or have the data), you can stay with a small data approach. Your insights will be limited, but it may be all you need. As a result though, lots of your current small data problems will persist.
And that’s what’s key: your older data didn’t become useless just because of big data. But the question is how you can get sources of data to give you a high resolution view of the business. If you don’t have these sources, that maybe a data-native approach isn’t right. But the more data you have, the more important it becomes to adopt a data-native approach.