The Future of Search: A Q&A with Grant Ingersoll

Donald Trump raised the issue of the fairness of Google search. Citing a variety of specious sources, he asked, “Is the Google Search ranking algorithm biased in some way?” While I don’t think he really has much of a point, it is interesting to think about why search is the way it is, and does it have to be that way?

I spoke with Grant Ingersoll, a contributor to Solr, the cofounder of the Mahout Machine Learning Open Source Project, and cofounder and CTO of Lucidworks, to help sort this out. This Q&A is drawn from an edited transcript from our podcast.

Earlier in my career, I worked on a Money Magazine ‘Best Cities’ calculator. Money Magazine had for years done a ‘Best City in the United States’ ranking in which they took many different dimensions of a city, measured them, and then put them in a model that combined all the dimensions with various weights.

On Pathfinder we changed the ‘Best Cities’ ranking system so that you could put in your own weights. You could decide whether art or food or culture was most important to you. This allowed people to get their own ‘Best Cities’ ranking. I think that’s essentially what’s up for grabs when we talk about what could happen with search.

Dan Woods: First let’s talk search in general. What is really going on? What are the basic things that happen?

Grant Ingersoll: In most consumer-based search engines, there are at least three levels usually at work.

First, some core algorithms under the hood that take care of parsing text, acquiring content, getting it into the engine, doing all of all the pre-processing and building of data structures that make it easy to look up keywords at runtime.

When you search with Google, there’s a very basic level of matching that finds content with those exact words or words like those or at least the concepts.

On top of that is some notion of popularity. Some might be familiar with PageRank or Clickstreamr.

Google has a lot of really smart people, but the reality is that we as consumers do the majority of the work. We vote with our fingers. We tell Google what’s important. And not individually; all of us together say, “Hey, I like this site, not that site.”

How do we tell Google that? How do they learn from us?

In the search business, we talk about implicit and explicit signals. An explicit signal is clicking on something. But there are also implicit examples like how much time you spend on a page, how long you took to decide which document to click on. Did you scroll? Did you walk away from the machine or go to a different page, open a new tab? Did you rewrite your query because you weren’t happy with the results?

There are a lot of things that people have worked out over the years that are good indicators of whether somebody likes something or not. And then there’s this whole concept of negative space. I may say I really like Beethoven and I hate Tchaikovsky, but that means I like classical music, because I didn’t even bother to rate Metallica. So there’s this whole negative space of things you didn’t even bother with that comes into play.

But once you have this evidence, an algorithm determines which dimensions are most important. What are the choices you make in determining how to rank results?

It’s true that ultimately people working on search choose the algorithms to apply. And most search engines do have some notion of editorialization or at least nudging results, or at least providing feedback to the engine that says, “Hey, this is right,” or “No, that is right.”

Especially in the earlier days of search engines, a lot of the weights, like in your ‘Best Cities’ example, were picked by hand. It was up to me as a search engineer to say, “The title of an article is more important than the last paragraph” or “I care more about user reviewers than I do about purchase price,” or something along those lines.

You also mentioned dimensions like freshness or relevance. What other dimensions are involved? Let’s imagine that someday we might be able to choose. What would be on that menu?

There are a lot. You hit on a number of key things, especially when you’re talking news. Obviously, freshness is a huge factor there. There’s also the notion of the authority of that site, the popularity of that site. Is it a newspaper of record, if you will, from the old days? Some personal choices go into those things as Google learns your behavior over time. For me, as a techie, when I go into Google, it knows that I tend to want things that are more technically oriented than perhaps equivalent terms that aren’t technically oriented. So there’s a lot of biases and levers that we can build into a system to help essentially narrow down all of that data.

Another dimension is length of the content. Like most people, I think, all else being equal, I would rather read something that’s shorter than longer if I get the same information out of it. And then of course there’s the editorial goals of the company or the newspaper itself. There’s as much as to what you don’t show as there is to what you do.

When we go on Google or any public search engine, we get a much better experience than we get from our own company’s search engine. We usually find it much harder to find what we need internally. What makes public web search better? Several years ago I wrote an article about how people were disappointed in the Google Search appliance because when they plugged it into their corporate data, they found that they didn’t get nearly as good choices and rankings as they did in public. Why is that?

The biggest factor is that you have billions of people on Google and a lot of the content is public. And so you have this massive, every day, every hour, voting scheme going on, whereby people are clicking and voicing their opinion on what matters and what doesn’t. In the enterprise, you typically see two things. I often say that one of the main reasons why not just enterprise search, but enterprise applications fail, is because they fail to account for what users do. And what users do at work is in many ways different from what they do at home because work systems force them into certain paradigms. Clicks aren’t necessarily at the same volume, but we often express our opinions or interests in different forms, forms that are harder to capture, things like email or Slack or Microsoft Teams or any of these chat systems that we use at work. We share content there and older search systems, like Google Search Appliance, don’t necessarily have access to them, so they can’t register the votes in the same way that one does up on the internet.

The dimension of personalization is much more prevalent in public web search than it is in internal searches. What happens, and what is the effect of personalization on search?

Interestingly enough, we’re seeing more and more interest in the enterprise, namely because consumers are also employees. When they show up at work, they say, “Well why can’t I have what I have it in my home life, Google and Yelp and Facebook and Amazon?” And the real answer is they ought to be able to. It’s just that the way you personalize at work is different. It’s often based on what role you have, your function within the company. It takes a bit longer to learn what matters to you because there’s not as much evidence, and there’s probably not as many people who have similar roles. Only in really large corporations do you see the effect of the masses at that level. But there are other ways, and there’s a lot of movement around machine learning and AI to bring more personal capabilities into the enterprise. It’s something I work on quite a lot, in terms of how we capture those signals in the enterprise, and bring that into an application.

Earlier you mentioned the PageRank algorithm as one of the foundations of Google, but now it’s really a proxy. PageRank isn’t the dominant force inside Google’s search engine. There are dozens of AI and advanced machine learning techniques examining all the content and trying to discern the concerns. Would you agree that, compared to internal search, public search has a hundred or a thousand times more density in the kind of advanced AI, analytics, and machine learning that’s applied?

There are different approaches to how you do things on the web or for consumers than you do in the enterprise. In the enterprise, you often have to do things in more unsupervised ways because you don’t have as many examples to learn from. There are still often hundreds if not thousands of features that one could at least attempt to leverage and learn from. Oftentimes you don’t need all of that, but it’s at least something you can experiment with. One of the cool things with the state of AI and machine learning these days is that systems are getting so much better at taking in lots of features and distilling them into an algorithm that can adapt very quickly to new and changing capabilities.

Could you explain what you mean by features? That’s a technical AI term.

A feature is an attribute of the data, of the content. It’s for instance, the price of something or whether it’s in stock, its color, keywords in the description, the title. It might also be how many people bought that thing in the past, or read that article in the past. All of those are features that can then be learned by a machine.

It’s what we often do as humans, but a computer can do it so much faster and so much more reliably, repeatably, and of course, at scale. We as humans can usually track five, six, maybe ten different features at a time, and synthesize them and say, “Ah, I see a pattern there.” The machine, given enough examples of those features, can do that at massive scale. That’s really what’s transforming the search industry these days. And back to the original question around Trump, that’s it’s ripe for manipulation by third-party sources. We’ve all seen the questions about Russia. We are aware of the battle going on around bots and spam. There’s essentially a constant battle over what features are chosen and not chosen, and if you think about it, as soon as you make that choice of feature, it then becomes a target for somebody to manipulate it in an adversarial way.

Let’s get to the bigger picture. Does it really have to be this way? Does search really need to be a black box controlled by a company like Google? One of Google’s responses to Trump’s accusations was that they really work hard on making a great black box that works really well to provide relevant content. What they didn’t say is, maybe we should provide you explicit personalization, allowing you to make some choices about the kind of search results you want. How would things be different if that were possible? And is it possible?

Well yes; it’s software. Most people like to say, “Anything’s possible in software, given enough time and money.” Would consumers actually even want it or care about it? People who are in the know and understand the things that can go wrong with a system that doesn’t have those capabilities would cheer it on. Privacy and security experts would be interested. But would the average everyday consumer?

Would the everyday consumer who is swimming in a sea of information want it? Probably not. If you’re looking up a recipe for barbecued chicken, do you really need to know why the algorithm chose the recipe? You either like the recipes or not, and if you don’t like them, it’s just as easy for you to retarget your queries.

And we should mention that there are other search engines besides Google. Nobody’s forcing anybody to use Google. Microsoft has a perfectly viable search engine, Bing. For those who like privacy, DuckDuckGo has made its whole living these days around being the search engine that doesn’t put you in the bubble, that doesn’t personalize toward you, and that tries to be more clear about what’s going on. In fact, some chunk of DuckDuckGo’s code base is open source. I’m not an advocate for them. I do use them a lot. One of the things I do as a search user is I switch between search engines, just to see how and why they’re biasing results in different ways toward me.

We mentioned two things that could be different. You could have a white box that showed you how the algorithm is working. We get that on Netflix where it says, “Because you watched this, we’re now recommending these movies.” Sometimes in email you get explanations of why a recommendation was made. And the other thing that you mentioned is that we could set our own parameters and change the weights, as in the Money Magazine ‘Best Cities’ examples. Are there other things that could be different? What else could be different in terms of the way that search could work?

We spend a lot of time trying to understand users’ intent and unfortunately human language has a lot of ambiguity. You’re starting to see with tools like Siri and Google Home and Alexa is a movement towards natural language dialogue. I think we all see some part of the future where we can actually have a conversation with the bot, and the bot is then able to ask clarifying questions. By asking those clarifying questions, it’s able to suss out your meaning. If I say, “Hey Dan, where’s the bank?” I might be looking for cash or a place to land my canoe. So context and all of those kinds of things matter, and search engines and bots are always looking for ways to help users better express their information need, what they’re looking for.

And I think  the sliders and features that you’re getting at will be context sensitive. When I’m doing research for an in-depth expose that I’m writing on some hard-hitting issue, I want to be able to go very deep on a topic, and I want my search engine to unearth all the stuff. Whereas if I want to find a good restaurant for tonight, it’s probably not going to matter.And so I constantly think about how do we slide that dial? And I think to your point then, is you want to be able to, at each position along that dial, be able to say, I care about what is going into this algorithm so that I can better judge whether I’m being manipulated.

Another dimension is specificity of search…

And time windows.

Or a range of answers—a narrow range or a wide range

Right. We do this all the time. If you go into a hotel with a good concierge and ask, “When does the shuttle leave for the airport?” the concierge responds “5:00 pm. Be here 20 minutes beforehand.” It’s a quick answer. When you say, “What are some good restaurants?” that person asks qualifying questions. Do you want steak? Vegan? Mexican?

A good concierge goes the extra mile and does some of the work for you. I think this is really where the future of search is going, making it more like a good concierge. For instance, it might say, “I see that you check in here every Friday. You should get your points. Do you want me to fill out the loyalty program for you? ” We want things to take away the drudgery, but we also don’t want to be fooled by them.

Why wouldn’t Google give us more controls? What are the implications of opening up the black box? I don’t think that Google is planning on doing this anytime soon. Why do you think that it’s in their interest not to open it up?

What does opening up mean? Do they give away all the code? Do they give away all of the data? Do they give away both? Do they give people access to all of the inputs that they have? Do they allow people to submit their own content directly and say it’s news? There are a lot of factors there, of course.

But let’s say they said, “Let’s open up the levers into the algorithm,” it’s still on Google to combine those into an engine that transforms those levers into results. I can see a lot of performance issues with that. Doing that across a billion users would be really hard to scale and customize because of limitations in the hardware that you have.

Putting that aside, Google is in large part an advertising company. Their job is to make money. They need to match ads to results. Their distinct differentiator as a business is the way that they are able to do those kinds of things. So by opening up all of that, they would obviously be inviting competition.

And manipulation, as well.

One of the simplest ways that you can “de-bias” Google is simply by opening up your search window in a private browser window using incognito mode, or use a VPN or a Tor client, or open up in safe mode or private browsing mode and turn off your cookies or delete your cookies. And you’ll notice distinctly different results from Google.

That’s an experiment that everybody can do: do the same search, one in an incognito tab and one in a normal tab, and see the differences.

I wrote an article about how the rise of the Amazon Alexa, Siri, and other bot-oriented search engines is going to kill SEO.

Now all of these things have a knowledge graph behind them. Siri and Alexa are querying a graph that allows them to pick one answer to give you. You don’t see that list of answers anymore, and this is a big change in the way people interact with search.

What are the other big issues at play that people in the search business are struggling with? Both in the public realm and in the internal and corporate realm?

Obviously knowledge graph and virtual assistants. We just talked a little bit about scaling answers from short, quick fact-based answers that Siri and Alexa are highly optimized to do. Knowing which sources are authoritative becomes even more challenging when you have lots of different versions of the truth, or at least versions of evidence, available to you. We’re constantly on the lookout for ways to distinguish the best way of answering.

In the enterprise, we often struggle with 27 different versions of a document, one’s in your SharePoint, one’s in your filesystem, and 10 of them are in your email. Which is the right one? They might all match, but which one matters at this point in time? Those are really challenging questions and I don’t know that there’s ever a single answer.

The notions of relevance and importance are a never-ending battle. I don’t think SEO or similar techniques ever go away. The landscape just changes. These days what’s really interesting is that a lot of new content types that are being unlocked.

The simplest one is images. Image search wasn’t even really doable in any practical way 5 or 10 years ago. These days you’re seeing real production systems where you can say, “Here’s an image. Go find me other images like this.” The same with audio and speech-to-text. I think we’re on the cusp of a lot of really interesting things coming together. Users are really going to like what they see coming out of the research field. And that’s going to help unlock a lot of productivity. And that’s frankly what gets me excited in the search space these days, along with machine learning.

Let’s talk a little bit about that, because your company, Lucidworks, has a platform, the Fusion Platform, that essentially reaches out to large corpuses of data and helps you organize them into smaller collections that might be relevant to a business process. And then, when you’re actually executing that process, you can use search or a knowledge graph or some other way to reach into them and get what’s relevant.

It seems to me that there’s an analogy between that sort of organizing of information into smaller collections that are relevant to a business process, and the kind of work we do in the business intelligence world, where you have ETL first to create an organized corpus of data, but then you have last mile ETL to create a purpose-built set of data for a dashboard. It seems to me that search is changing so that you’re creating these smaller collections that can be then delivered to a business process. Is that one of the things that Lucidworks is really all about?

That’s a pretty good description. There are a lot of worlds colliding right now in the data space. And obviously, I’m biased because I work for a search company, but I think the core capabilities of a search engine help you reframe how you think about data in really interesting ways.

If you’ve ever done SQL or worked with Excel, there’s traditionally been this model of rows and columns of data, and then you filter some amount of that data, and then you say, “I want to sort by time” or “by revenue,” or some column. It’s essentially a linear monolithic approach of “Sort by this column. No, sort by that column.”

Search engines turns that on its head and it says, Let me mash together all these features of the data. And let me rank them according to all of these different weights that I’ve learned over time, that I’ve factored in from the data itself, as well as user behavior and whatever editorial rules and business processes I have.

Once you start to reframe how you think about data that way, from a monolithic view to a view seeks to know what’s important in the data, you can transform the way you think about applications. It’s no longer about showing a user all the data that matches. It’s showing them the data at the top that matters to them, and empowering them to take action on it. From there you attach this into workflows and build out applications.

At Lucidworks, what we really try to do is bring together those two worlds of helping you better understand what’s in your data, but then enable you to take action with it, through your business process, your workflows, and your configuration management. The hygiene of running a business.

What’s a good before-and-after example of an application working according to the row-and-table sort paradigm, and an application working according to the ranking of features and what’s important paradigm?

What we often see is multiple applications working in concert together.

Let’s say we power Acme Company’s retail search such that when you go in and search for widgets, the application says “Buy Widget A, buy Widget B, buy Widget C.” We’ve done all that fancy ranking and machine learning that learns from what users have bought and purchased, so we have all of this click data and purchase behavior data. Well guess what? The data that makes for better search also what makes for better recommendations, personalization, query intent, and query understanding.

Most importantly, it’s also your analytics data, so then the second application that very naturally falls out of a smart ecommerce retail solution is Customer 360, which shows how your users are behaving and how they engage on your site.

Traditionally, in most companies, these are two separate things. Marketing keeps all the data in Omniture or Google Analytics, and then search is over here, and recommendations are over there. For us, that’s all the same, because it’s all the same data.

And so in the Customer 360 app, I can put a dashboard on top of it. But one of the things that we do, and that search really powers, is we can drill into the why behind the dashboards. It’s one thing to look at a dashboard and say, “Sales are up by 3% this quarter.” It’s another thing to then click through that dashboard, run a search and say, “Here are the dominating features and factors as to why sales are up.” And then recommend the logical follow up on that, perhaps to congratulate the people responsible.

I see. With a traditional dashboard you could see all the inputs and then decide which was most important. You’re saying that with this feature-based approach, you can see and sort which one is most important.

And not only that, but take the next logical step. Most traditional BI tools stop at the dashboard. And it’s up to the human to decide how to interpret it. There’s a jagged line on that line chart. What should I do about that? or I think that’s Dan’s responsibility or That’s Grant’s responsibility. Let me call him.

If you start to think about how to attach workflows and business processes to that, and show people what’s important, how to take that next best action and then follow through on it, you can build a much richer experience. And then that can feed back into the beginning of the cycle again and feed back into the search engine or into the buying patterns or the supply chain.

So it sounds like a lot more like that you’re creating a knowledge graph that then can be influenced by behavior, can be influenced by people’s reaction to it, and start optimizing, and optimizing both in the first answer it might give, but also optimizing in the transparency of the explanation and the causality involved in all of that.

Exactly. And that’s the beauty of search engine. And by putting all that traditional BI data that usually lives in a BI warehouse into a search engine, you also give all your users easy access to it. All your users already know how to search, so if you wanted to ask about sales for 2018, just go pop that into the search engine, as opposed to talking to the IT team to run a BI report or create a dashboard.