Is AI Dependent on AI: A Q&A with H2O.ai’s Sri Ambati
It seems like every product on the market today advertises its use of AI. But separating AI myth from reality is crucial for businesses wanting to make the right decisions about what products to adopt.
Recently, Early Adopter’s Dan Woods spoke with Sri Ambati, CEO of H2O.ai. Woods wrote about the company in 2015, where he described their platform and how it had created a production-quality version of many of the algorithms used for machine learning in R. Since then, three years have gone by and a lot has happened. In this edited version of their conversation, Woods and Ambati discuss the evolution of H2O.ai’s platform and the state of AI in general.
Woods: How would you define what H2O.ai is right now?
Ambati: The vision we started with is the same. We remain deeply committed to getting the world’s most powerful toolchain in AI in the hands of the people who need it the most. Often those people are not the ones with the most ability to pay, so democratizing AI has been the theme. That was why we chose open source and started building out some high-grade venture-funded machine learning algorithms and releasing them as open source. It was a way to reach a wider audience. We were trying to get feedback, not code, through our open source. And satisfying a gap that was missing for decades.
We built algorithms that were faster than those on the market, like processing a billion-row regression with our first generation of products in 5 seconds. From then, we’ve seen rampant adoption of the technology. Now there are thousands of companies in the whole AI revolution, both from investments from companies to picking that up as a problem to solve in different verticals. H2O got embedded in a lot of these moments. Now, we’ve seen every major platform that offers machine learning start first being open source, and then trying to build an ecosystem around it. Over 15,000 companies end up using H2O today, and about 200,000 data scientists use it on a daily basis.
If I were in the enterprise and I was trying to understand what H2O.ai means to me, it sounds like what you’re saying is that we’re not a platform like TensorFlow where we’re focused on a specific type of algorithm t. You are more of an umbrella platform, so you could use TensorFlow from H2O.ai?
We actually do use it.
And you could use any important algorithm that comes up and any important library that is created. You’re creating an enterprise or a production-quality platform for developing algorithms, tuning them, and deploying into production. In doing that, as an open source company, you have to have a value-capture mechanism. How does H2O.ai make its money?
The vast bulk of our revenue, up until the tail end of 2016 and a bit into 2017, was support and maintenance contracts for our open source H2O platform. In open source, fundamentally you get paid for fear, uncertainty, and doubt around your platform. If you build a really good platform, the fear vanishes over time. One of the unique things we’ve done is making automatic generation of code from the model a core theme of our product. Science, you see, doesn’t scale, but software does. We have 30 million developers, and if even half of them start doing AI, you can create a massive, bigger moment because software is eating the world and you can change almost every function in the world to AI.
And so we took the data science models, algorithms, and automatically generated code that would then go into every cellphone, so that today every model uses H2O’s algorithms to prevent fraud in the payment systems. Many applications online are using H2O in their websites to do A/B testing or predict which product you will buy next.
The reason that this is important is that many of the tools that were used were not created by software engineers. So R, for example, was created by people in the statistics space. It has never actually been retooled with professional software engineering, and so your libraries provide that missing element. Now those libraries are open source libraries, so you just can use them; you don’t have to license them, correct?
They’re open source under the Apache license, so you can embed them and use them.
AI has a two-phase aspect. First is the training and creation of the model, and then there’s the execution of the model, so I’m guessing code generation is all about creating that production scoring algorithm that can run at high speed.
It’s also a bridge between two or three different communities: the data engineers, the data scientists, and the software engineers. And there’s a bit of uncertainty between these three roles and a bit of distrust, as they come together to build as a team. And that’s where the uncertainty remained, so we almost have close to 100% renewal rates. Because it’s not about the software product being buggy or difficult to use or implementation; our open source created a natural tension because it connects three different communities. And as a result, any one of them deciding to buy a support license got us a support contract. Unlike traditional open source where they are trying to build a layer on and the core is reasonably difficult to use, our open source product is actually very easy to use; it’s dead simple. In fact, you need very little support and maintenance costs on our side.
Is your business model evolving from the support-only business model?
Wins from the customers paying support maintenance ended up getting us into the Gartner top position among our peers. Because we serve customers well and are devoted to making them successful, they have given us top scores in customer support and sensitivity to their feedback.
But fast forward to the turn of 2016. We saw that all of software is going to be replaced with, rewritten by, or made better with AI. AI also started influencing hardware. The workloads of software transformed hardware and the rise of GPUs and newer technologies, like TPUs and PGAs to serve ASICS, became much more prevalent and accepted by the turn of 2016. And so we did an experiment to see how our machine learning platform ported to GPUs would do. And it was another 20 to 30 times faster. So from five seconds we went now hundreds of micros for a billion-row regression. That means that now, we can do a lot more within the second. Going from 8 hours to 8 minutes to 8 seconds meant that you could use human clickstream mindsets to do things. But now, at millis and micros, you’re really at a point where you have to automate it. So that got us to thinking about how to expand the core value proposition of H2O. Instead of focusing on what I call the assembly language of AI, which is machine learning algorithms, can we now build a compiler, which automatically generates, for a given problem at hand, with a given dataset, a time series problem predicting weekly sales? We wanted to automate things like that.
So this is similar to the automated machine learning approach that something like DataRobot uses for creating predictive models?
It’s a design pattern for all sorts of automatic machine learning. DataRobot was one of the early players who mentioned this as their core focus, but they were unable to go down the layers, and look at say, the rise of GPUs. Because we built machine learning algorithms, we could go one layer below and say, “Okay, let’s port this and see what happens. Let’s port this to other platforms and take advantage of the end-to-end stack that can then transform the entire space.”
We then said can we come up with templates? We call them recipes. And those families of recipes, some which we built, some which our customers can build or our communities can build, can we use them to take advantage of our open source mindset and made a collaborative framework, which others can add their own vertical recipes? And while we are focused on building the horizontal recipes, we are then integrating it with other products, like Google AutoML or Amazon SageMaker. We bring a desire to see the overall AI space succeed.
This is what you call “Driverless AI”?
Yes. We put it under an umbrella of Driverless AI. Now it uses other tools as well. PYTorch, for example, which passed that barrier. We expanded our core team to reflect that. We have systems engineers who essentially are building closely with mathematicians and physicists. Physics folks became the core of the company. As data scientists were using our product, instead of trying to give them recipes to build, we took their recipes and productized them because of the systems thinking we have in the company. With H2O, our mindset is scalability across multiple dimensions. We do the reverse problem, which is we solve it the other way around and leave all the features, using a genetic algorithm.
You determine which features are important and increase the accuracy. So H2O’s gone from improving some core machine learning algorithms to creating a machine learning environment where you can do driverless AI. And then you’ve also expanded with a couple of adaptations like the Sparkling Water, which is taking machine learning algorithms to Spark. And are there any other parts of the product?
Those are the three core offerings. And Driverless AI, interestingly enough, we chose it to be a proprietary license. Then you have the open source flavors of pure machine learning, and its connectors into Spark and Hadoop. And then we built a completely new product, which is solving the concept of feature engineering, experimental efficiency, and preventing overfitting.
How do you capture the value on that? Is it by core? Do you charge support by cores or by named users?
In open source, we wanted as many users as possible, and so there we priced it by memory. And on the closed source, we have a per-seat license and a consumption model license. The per seat is when you have data scientists using Driverless AI. And as compared to other players in the space, our vision is to augment and enhance data scientists to do more, as opposed to replacing data scientists.
There’s still a chef in the kitchen.
Yes. But they have a power tool.
You can do the automated machine learning to understand what the data can tell you, and then you can do the automated machine learning to take those signals and make them into a predictive model.
Yeah. But this still solves only two of the major crises in AI. One is the talent crisis and the other is that there’s not enough time. Time being if your competition has already adopted AI, you need to race to finish it. The third crisis in AI right now is trust. How do you trust that this model is doing well? And almost every time we run into other toolchains customers have considered, they want to know what the safety conditions are. What are the boundary conditions? How do you get the explainability or interpretability of models in production and get them to consent, collaborate, or conflict?
You see a whole slew of products that emerged in that space, but where we are thinking of both the AutoML space and the explainability space, is we have expensive DBAs that have emerged after the rise of databases and ML is the new SQL. And so we have data scientists, really good ones, becoming quickly the most expensive resources in building AI-first applications. What we ended up asking is, Can we create an AI that can build most of AutoML, which we see as an AI to do AI? Can I use AI to reduce the complexity of building a model? Can I use AI to prevent overfitting? We’re codifying our expertise into software.
My view is that the vast majority of companies are not going to use H2O, but are going to use products developed by people who use H2O. Then they will be able to buy those products, and to use those products successfully, and to use them better than anybody else, they need to be sophisticated consumers. At some level, people need to be able to detect things that don’t make sense, and understand, as a sophisticated consumer does, the way AI is being used in products. Second, it’s important that you can successfully use products to master your data landscape. And what I mean by that is understanding all the data that you have. How can I clean that data? How can I manipulate and reshape that data? How can I deliver the data relevant to every workload? And if I am able to do that better than other people, the AI products that I have will work better for me than they would work for other people. Now, you disagree with that statement, right?
I think any systems revolution or technology innovation that doesn’t produce consumable applications will die and will not see fruition. Big data is a notable example, where it did produce a huge ecosystem of big data applications because the average application doesn’t necessarily have that problem. Plus the APIs were not necessarily there to build an application. So we’re clearly not saying that. We are saying AI-first applications will rule the future. In fact, almost all applications and all verticals are being touched already, whether in the enterprise space or even the consumer space. And the whogeopoliticalcal landscape is being affected by AI today. And so AI is touching every phase of life. AI is going to 3D print AI. ASICS will permeate every form of life. In fact, H2O.ai is named such that H2O is needed for life on this planet, and AI for life on other planets. And so in our mind, AI should be like water. But at the same t,ime we don’t want to overhype the space to the point where there’s dramatic fear in the population that AI is actually going to replace humans everywhere and create chaos in the economy, as well as in the overall world as we see it today.
So do you believe that people who have a better mastery of their data supply chain will be able to more successfully deploy AI?
The fastest time-to-market for people comes for those who have well-defined data supply chains. But one of the interesting nuances, a topic where I diverge from all that, is data is very interesting. Data has gravity, so you are pulled into the supply chain on prem, as opposed to going on the cloud or the newer technology base. Data has an expiry date, so if someone has not used their historical data over the last six months to two years, most of the data is no longer that interesting for future events. So data’s expiry date means that people who think that they have this vast amount of data as their advantage are going to be suddenly in a shock. The second big development that’s happening is generated data. People can create very real looking fake datasets. We are seeing some of that as fake news and fake emails that go out.
To take an example, money laundering takes suspicious reports that are created and those reports are then human-curated to saying, “Is this good or bad?” But as a result, banks end up creating very large positives, of which, again, a large chunk of them are false positives. The true positives were the ones that you need to feed back to improve the algorithm. The advent of the newer techniques like GANs where you can now generate data and create an adversary network that essentially fights this generated data, detecting what is fake and what is real, to the point where together they are able to identify the problem.
What you’re doing is you’re allowing the AI to play itself the same way AlphaGo played itself and got better by playing itself? So you’re certainly not arguing against having a good supply chain. But you’re saying that if you think that that’s going to be the biggest advantage, it may not be?
Yeah, I think the easiest of the value creation with AI is taking a very important business process that the customer has, and then reducing its time from six months to say six days or six hours. A simple example: if you’re leasing a property, today it’s a multi-month lease negotiation process. And then you look at WeWork, where you can rent or get an office space in a matter of a few minutes.
One of the areas I’ve been studying is cybersecurity, and there’s a huge number of applications of AI in cybersecurity, for everything from anomaly detection to various types of active defense. How is a space like cybersecurity being affected by the kind of things that H2O offers?
Cybersecurity is a very mature space for algorithms. The best algorithms have always been in use, by both sides of the equation, the folks trying to protect and the folks trying to attack. And they’ve always used distributed means of attacks. And corporations are particularly susceptible to algorithmic attacks coming from everywhere in the world. And with the proliferation of devices, we now have more vectors of attack than before. And every bit and piece in that whole equation is a time series. And so streaming data and time series problems have occupied a bulk of our automatic machine learning thinking, and a large part of AI research has changed how we look at time series. Now we’re seeing the combining of text data with time series and transactional data to build a more robust AI solution. They’re figuring out how data is flowing through different parts of my infrastructure, how processes are going through infrastructure and labeling them at the first sign of an anomaly, and then tracking them. We’re building tools that can take the protectors of these systems at the point of attack, not just after an intrusion, or after an intrusion doing the usual normal things. So overall, traditional housekeeping can be mostly automated with AI. AI just enables that overall understanding, how our employees work, how the insiders are moving and using our systems.
You predict it’s going to make almost every type of cybersecurity better?
Yes.
One of the things that is tempting to do with all of these tools that are out there is to roll your own AI solution. Obviously, you’re an advocate of using AI. But—and as I said earlier, my view is that AI is going to be adopted most widely through products that other people create and then sell, and then most people will use those products. Now I believe that there’s going to be, just as there are in every era, a lot of people who want to punch above their weight class, and they say that instead of waiting for the products or using products, they’ll actually try to develop AI, when maybe they shouldn’t. And so what do you think the necessary conditions are where it makes sense for you to actually do the work to develop your own AI/ML solution? And what are the signs that you should just wait and use somebody else’s product?
It’s a matter of getting the rest of the industry to code on the vision that we have wanted. It’s not just us coding on machine learning algorithms. Everybody is coding machine learning algorithms with large software stacks. Now we ourselves disrupted by going one layer above and starting to use others’ algorithms instead of recoding PyTorch or recoding a deep learning GAN algorithm implementation in Karis which underlies TensorFlow. So we have chosen at times to collaborate, at times to build, at times to fill the gaps where needed, where others have not yet focused on that. Because overall, we want to make sure we have all the best algorithms easily accessible. But now, with AI becoming a must-have for the largest countries in the world, to largest corporations in the world, there is a temptation to build algorithms that have been built already. And some companies have done that, and successfully even actually delivered non-generic customized engine for their use, to a point where it’s much better than the generic platform. It becomes their IP if you will, or their advantage over the next competitors coming into that space. Algorithms are code, and code is a commodity to open source. And that deflationary aspect of open source, and technology in general, is playing against anybody who is trying to go build generic core algorithms.
Inventing new math has some value, and that’s happening in spades across economic world circles because of AI, and we work closely with university professors and sometimes fund their work to even get them to advise us, and learn their latest insights. But I think what is critical for a good AI-first application is actually the ability to make new data, which is forcing feedback from the user base. So every large company today faces an existential threat from AI. In other words, if they don’t transform themselves and their business processes with AI, someone else will. But the massive advantage every large company has is their community of users. It is in their interest to get that community of users to give feedback. And feedback in a way that no one else can get. And so AI-first economies are going to be dominated by the companies that integrate the feedback loops back to improving their data ecosystem and using that to go build a better application.
What you’re saying is it’s not just about having a data supply chain that is able to move and transform data, it’s about having a data supply chain that can create or find data that nobody else has?
Yes. The unique value prop. Now the other aspect of this is that giant conglomerates, say Alibaba, have a distinct advantage in using AI. And this boils down to thinking about who is your competitor? An insurance company’s competitor is no longer an insurance company, or a payment systems company’s competitor is no longer a better credit card. It’s really other software companies that are coming into their space with AI. So data is a horizontal play, and the best discoveries in a particular business unit oftentimes are not useful to that business unit. There’s a lot of accidental discovery from doing data science everywhere, including in industries. And so a payroll company could find out that they can predict the salaries of any company’s employees for the next five years. But that doesn’t help the payroll company, being a customer. They are judged by whether they can support a new geography or a new city. But next five years’ salaries is useful for a Carnival Cruise company or a car company that’s trying to sell cars, or someone else who is trying to give them access to credit. So I think that kind of ability to transcend the traditional boundaries of a business unit and create a marketplace for intelligence is key.
Do you have any guidance for when to and when not to, try to do your own AI?
Yes. So the biggest AI risk that we see is actually around not having the right kind of talent to apply on the problems at hand. A very large credit card company down the street here in New York City was always building their own algorithm platforms, even including boosting machines and other pieces. And for years they were right in their decision when there was no viable open source or viable alternative out there that was really good. And it’s been now proven that not just us, but several people after that have built good machine learning platforms that. Now if someone comes up with a new mathematical technique that is not yet available, and TensorFlow has a few of them, then they could choose to build it themselves. And keep it propriety and have that advantage over the rest of the world. And I think even that large credit card company has now adopted open source after six, seven years of delay of getting it into their stacks. Now they had the wherewithal to build it, but couldn’t retain the talent. Over time, AI became hot, and these talented systems builders left the building and joined Facebook and Google and others. So it’s very important to pick the battle at the right level of infrastructure. Some companies that try to not use outside platforms can succeed. Like Stitch Fix. Stitch Fix built their own platform and chose to ignore trying to learn the open source frameworks that are still changing and molding around this time, and when they started they basically chose to build their own, attracted talent from Netflix and then customized everything and ended up winning. But the real cause celebré for their win was the ability to force the feedback. They would ship the clothes at your doorstep. You had to keep the violet sweater or ship the yellow one back. And that gave them additional information that no one else had, of which colors you don’t like. That was probably the more ah-ha moment of their overall cycle.
Essentially, it’s like being an actor or a comedian—are you going to sit here and say you will not be the next Leonardo DiCaprio or Claire Danes? You’re going to try to be the best even if it’s not easy. And that doesn’t mean that you’ll be successful. If you have the right talent, if you have the right data, if you develop a way to create data that nobody else has, and get feedback nobody else can get, you have a high chance of success.
I think the bigger challenge for being in the AI-first economy would be to bring that transformative aspect of can I bring the best-of-breed from my company, and take them away from what they’re doing and ask them to focus on this problem? The transformational nature of AI culture means that it will come back with a counterintuitive. It’s the Moneyball for business. It’s going to come back with a completely different viewpoint on a problem, on a solution, then what was preconceived. And the courage to make those choices that AI’s recommending forces one to think out of the box. The other aspect of AI which is actually again, a very lesser talked about topic, is AI Is both available to good and to evil. And so we are seeing the rise of AI that is trying to kind of take over even the democratic process. So how do we build a strong ethical backing for AI? And of course this is the least talked about topics in industry. How do we take the best people of our generation and have them solve the most purposeful problems? The last 500 years has been science was the most important thing to do. So the onus now pushes to how do we get imagination and purpose and empower them with data, so that AI can solve that problem. I think the onus shifts away from when traditional worshiping of the latest speeds and feeds and technology that comes from the Valley. I think the new world will be more around am I applying my best talent to solve the most important problem for my company? And how do I use either open source or open data or data privacy movements to create more trust for my brand community? It’s now proven that people who build good communities live longer. And so being a social animal, we will end up having to build large communities and using AI and open source and open data, to create that community of trust and value is going to be interesting for the economy.
How do you interpret the importance of the purchase of Red Hat by IBM?
Very pertinent question. I actually think it was a fantastic move. Red Hat is the only proven large open source company with a tremendous track record and deployments across every industry. And so it was a one-of-a-kind. The Red Hat is the Taj Mahal of open source. From that standpoint I think it’s a very sweetheart deal in many ways for IBM, despite what the markets might think right now. It will infuse both an amazing bounty of leadership and talented open source professionals across every aspect. Open source is no longer a good-to-have or something you have alongside a closed source offering. You need open source first, and then try to build monetization strategies around it.
Do you think that it will accelerate the replacement of Watson’s proprietary technology with open source technology that is branded as Watson?
I think Red Hat hasn’t done a lot of investments in AI. So I think this probably has an immediate impact on the cloud wars and the on prem versus cloud wars. And strengthens IBM’s on prem offerings as well. Historically, the Watson ecosystem is closed, and I think this should take the same path, where eventually they will have to seek open ecosystems in AI as well.