Explaining Spark to Your CEO

There comes a time in the life of every important technology where the CEO gets interested and wants to know what the fuss is all about. That time has come for Spark.

There comes a time in the life of every important technology where the CEO gets interested and wants to know what the fuss is all about. That time has come for Spark, an open source platform for creating applications using big data, streaming data, and machine learning.

From a CEO perspective, Spark’s breakout moment perhaps came when IBM announced in June that Spark was “Potentially the Most Significant Open Source Project of the Decade”. This announcement was more that just a press release. IBM said it was committing 3,500 people to work on Spark-related projects, would seek to help train more than 1 million people, and would embed Spark in a wide variety of its products and platforms. In addition, IBM is donating some machine learning technology to the Spark project. Not since IBM embraced Linux has the company given such a serious digital bear hug to a technology.

Why is IBM doing this and why are so many other players large and small seeking to connect to Spark and make it work in their environment? This is what the CEOs of the world want to know: not so much how Spark works, but why it seems to matter so much.

IBM’s statement in its press release fell short of a workable explanation:

Spark is agile, fast and easy to use. And because it is open source, it is improved continuously by a worldwide community. Over the course of the next few months, IBM scientists and engineers will work with the Apache Spark open community to rapidly accelerate access to advanced machine learning capabilities and help drive speed-to-innovation in the development of smart business apps. By contributing SystemML, IBM will help data scientists iterate faster to address the changing needs of business and to enable a growing ecosystem of app developers to apply deep intelligence to every thing.

The problem with that paragraph is that you could substitute many open source solutions for Spark and it would still make sense.

Our job in the rest of this article is to offer a nourishing explanation of what CEOs need to know about Spark.

What is Apache Spark?

The first thing to know about Spark is that it is an open source project that emerged from Berkeley’s AMPLab. Started in 2009, the project fills an important role in the gradual emergence of a full application development stack for big data and streaming applications.

In my article, “Explaining Hadoop to Your CEO”, I pointed out that Hadoop was essentially an extension of the operating system to handle the need to store and process big data sets. The magic trick that Hadoop performs is to allow huge amounts of data to be stored on a file system distributed over a cluster of commodity hardware. Hadoop created the Hadoop Distributed File System (HDFS) to store the massive amount of data sets, and implemented a programming paradigm invented by Google, called MapReduce, to allow that data to be processed. Hadoop changed the economics of storing and processing big data so profoundly that it took a long time before the early adopters got annoyed with having to program in the initial set of APIs.

Spark is important because it was built from the ground up to be an environment for developing applications that run on distributed computing clusters. Spark separates the data it uses into Resilient Distributed Datasets (RDDs) that are the core containers for data in the same way that files in HDFS are the way data is stored in Hadoop. The way that a Spark application works is that you connect with a data source, drag the data you are interested in into RDDs, and then write programs. Remember, Spark is not a replacement for Hadoop. Spark is happy to suck data into RDD from Hadoop clusters all day long. Of course, Spark also consumes data from dozens of other sources as well.

In a way, Spark is similar to Hadoop in that both are systems for distributed processing. The Spark core manages a cluster of computers that are used to run applications over data stored in RDDs. On top of the Spark core, there are components for writing different types of applications.

Spark is different from Hadoop in that it is focused on supporting a powerful and expanding set of capabilities for big data and analytics applications. Spark has four different programming environments: one for SQL applications, one for machine learning, one for streaming, one for graph processing, and one for R. You can write code in popular languages such as Python and Scala as well. There will likely be more to come. All of these run on top of the Spark core. In the latest release, the MLLib and GraphX libraries have been enhanced.

So, what’s cool about Spark from an application developer’s view, is that you can use SQL to do one kind of work on a set of RDDs and then pass those RDDs over to a machine learning application and then perhaps send them to a graph application, and so on. The data doesn’t need to be moved around to be used in different ways.

The project is governed by the Apache Foundation’s governance process, but it is controlled to a large extent by its founders who run a company called Databricks that is commercializing Spark.

Spark, like Hadoop, has become the center of a large amount of development energy. Most large big data players, both new (Cloudera, MapR) and established (IBM) are eager to show how they can make use of Spark in their environments. Companies like Palantir, which has been incredibly secretive about pretty much everything, took the stage at Spark Summit East to show how it was using Spark to create a data prep environment.

The trick for the commercializers is to find a way to channel this excitement into a purchase of their offerings.

Why Does Spark Matter to a CEO?

So, why and when does Spark matter to a CEO? For most CEOs, Spark doesn’t matter. It is just another technology that allows for the creation of powerful big data applications. If that is your business, like it is for IBM, then Spark is potentially a game changer. Or, if insights from big data are going to be crucial to the success of your business, then being on the leading edge of innovation is crucial.

But if we take a look at the situation facing most CEOs, the sad truth is that a huge amount of projects need to be done before using Spark to find and exploiting signals from becomes the critical path.

Before Spark becomes important, you must be able to do the following things:

  • Land big data in an environment that allows it to be discovered by your staff.
  • Clean and prepare the data for analysis.
  • Examine and analyze the data to find the signals it contains.
  • Blend the data with other sources.
  • Transform the data into shareable objects to support analysis.
  • Keep track of the lineage of the data and the transformations applied to it.
  • Apply analytics to the data.
  • Create reports, visualizations, and dashboards.
  • Integrate the insights found in the operations of your business.

To be sure, a programmer could apply Spark to achieving some of these tasks. But most CEOs should be looking for environments like Platfora and other products that deliver this end-to-end experience.

As a CEO, your goal is not using Spark, but getting the needed data into the hands of as many people as possible in your company so that they can make better decisions. When Spark becomes the most crucial next step, you will know your company is doing pretty well in terms of making use of data.