Creating a Vascular System for External Big Data

Start planning for a world where signals are harvested from external repositories through APIs with built-in analytics.

It is time for CIOs and CTOs to start planning for a world in which signals are harvested from external repositories through distributed APIs that have intelligence and analytics built in. So, today I’m starting a mission to explore how CIOs and CTOs can lead their companies forward to make the most of this emerging opportunity.

Recommendations

I don’t expect most people to rush to implement an architecture around this vision until some disruptive victories are won through the use of external data, big or otherwise. When these initial victories are won, it won’t be in the interests of those who achieve them to let anyone know. My recommendation, one that I am following in my own business, is that it is time to think of acquiring access to external data as a business development activity. Here are few reasonable first steps:

  • Understand what would be valuable to know. In my Data Not Invented Here article, I suggested that companies play “The Question Game.” This exercise arms them with an understanding of what signals would be valuable
  • Instrument your internal systems. It is quite likely that internal systems can generate valuable signals. Where could you be capturing new data or recognizing important events that are going unnoticed? Semantic logging inside third-party or custom applications can create a new type of raw material.
  • Seek signals in easy-to-access external datasets. There are lots of data purveyors like Infochimps and Factual who can provide access to new types of datasets. While these companies know what the data can provide in a general sense, they don’t know what signals it can provide that might be meaningful to your business. Start experimenting.
  • Hunt down external data. Once you start to understand how to create signals out of noisy, dirty, incomplete, external data, you will then know where nuggets of valuable information are found inside other companies. The biggest victories will come when you figure out who has the data you need and partner with them to get proprietary access to it.
  • Offer up your data. A greater understanding of how to gain value from external data will lead to a recognition of the value of the data you possess. Don’t be shy. Create an API and offer it up in exchange for money or some other sort of fair trade.

Once this process starts, technology challenges rapidly emerge. As I pointed out in External Data Opens a Disruptive Frontier, most of the time, moving these datasets won’t make sense. The key questions become:

  • How do you get access to the data?
  • How will you distribute intelligence and analytics capabilities so that you can harvest signals from the data without moving it?

Initial Thoughts

Mastery of APIs will be crucial.

Apigee’s strategy mentioned in the External Data Opens a Disruptive Frontier story shows the way. API technology must become smarter and have a memory. Apigee is creating just such an infrastructure, but there will be plenty of other ways to get this done as well.

Streaming infrastructure will be required.

Signals and data will arrive in small chunks. Technology like Splunk is perfectly suited to harvest and make sense of streams of data. Splunk can also deliver the data to repositories for use by applications.

New databases for new types of applications.

It is likely that many of the signals recognized will have a short time to live. In-memory databases like VoltDB, SAP HANA, and MetaMarket’s Druid (which was recently open sourced) each have different strengths to offer application developers.

This data also changes shape frequently. NoSQL databases like CouchBase and Basho’s Riak provide flexible storage systems for data that is partially structured, which is the form of many modern data. But in many cases, signals will come from data examined over a long time in aggregate. Hadoop and MPP SQL databases like Vertica, EMC Greenplum, and Teradata Aster will form the memory.

The world of databases is exploding with potential. The challenge is finding the right fit for your application.

Machine learning will be required.

Companies like Opera Solutions, AgilOne, Numenta, and ZestCash are pioneering the application of machine learning techniques for business applications. Much external big data will be noisy and dirty and have lots of weak signals. Machine learning techniques will be the only way to find patterns in the vast quantities of such data and monitor it speedily.

High-resolution modeling.

Recognizing a few important events from noisy data is a great start. But the biggest victories will come from creating a much more detailed view of your business, a high-resolution model. Most business systems are based on a low-resolution view.

Now imagine that instead of 10 customer segments, you have 1,000 or 10,000. Machine learning and in-memory technology will be crucial to creating and using such models. In the mobile space, AirPatrol is breaking new ground in creating contextual models that bring location awareness into high-resolution models.

Expand direct access.

One of the less noticed aspects of the Obama campaign’s success is the direct access to big data that was provided by a massive database of voter data stored in Vertica. To really get the most of external big data, the capacity for discovery and experimentation must be expanded.

The Obama campaign had Hadoop running in the background, doing the noble work of aggregating huge amounts of data. But the biggest win came from good old SQL, providing access to dozens of staffers who could follow their own curiosity and distill and analyze data as they needed.

In my view, QlikView, Tableau, and other technologies like them that allow for do-it-yourself business discovery by a large number of people with varying skills sets will close the loop between the availability of more signals and their use in a business.

My Mission

These thoughts represent my starting point. My mission is to identify specific patterns that select from the vast number of capabilities mentioned that can create value. Then I want to tackle the leadership challenge and show what sort of organizational and other issues must be addressed in bringing these patterns to life in the real world.

Please comment below if you have suggestions or thoughts that will help me complete this mission.