Old-Search ~ training wheels of Neural-Search

Why it might be time to be semantic-first and not lexical-first in building search systems

Dec 05, 2021

Old keyword-based lexical search can be the training wheels of new neural search!

With recent advances in machine learning, the state of the art in search has finally advanced from a rules based engine to a self learning neural network. In this note, we show how the previous state of the art in search can be the training wheels of new neural search. We also encourage people to adopt a neural-network first approach to search going forward!

What is Search?

Search systems accept a query from the user and return a list of documents. Here we are using “documents” more generally and in practice what the system returns in more domain dependent. It could be videos, podcasts, images, etc. The first generation of such systems have been built to return documents which literally match as much of the query as possible. This is also called Lexical Search.

Lexical Search

At a high level, lexical search is more likely to match a document to a query that contains the words in the query and especially those documents that contain the query words/terms multiple times. A popular approach here is based on term frequency, namely TFIDF or its minor variant BM25.

Some of the limitations of lexical search are:

vocabulary mismatch: the gap between what search systems expect users to query and what they actually express their query in.
semantic mismatch: the query might match some words in the document but those words in the document mean something different (due to context). For example matching Matrix (maths) when search for “The Matrix”!
lack of personalization: these term matching rules are the same for everyone and don’t recognize what you mean without tons of hacking (e.g. a user searching ‘Ronaldo’ in Brazil probably means the Brazilian soccer player and not the Portuguese soccer player)
lack of awareness of context: a query “world cup scores” probably means different things depending on the world cup that is going on right now (temporal context) and where the query was sent from (local context).

Agreed that there are inadequacies of any system and lexical search (what I refer to as “old-search” in the title) has overcome some of these weaknesses using heuristics and additional modules like tokenization, spelling correction, a learned ranking module and probably billions of human-hours. But can we do better? And maybe you are thinking … “what’s the point? why do we need to do better?” To those, I point you to the podcast below where Pieter Abbeel recounts a story of what Geoffrey Hinton said about Ilya Sutskever (founder and Chief Scientist at OpenAI), that Ilya would be deeply unsatisfied with the status quo when he felt there was a better way.

In this article we will talk about why a neural search system is (a) at least as capable as a lexical search system and (b) unlike lexical search can automatically learn from user interaction with the system and (c) is much better than lexical search when augmented with a knowledge module.

What is semantic search?

To quote the SPAR paper, semantic search employs deep neural networks to learn how to represent queries and documents from what users actually click.

Comparing how Lexical search needs multiple isolated modules to prepare the query for “Inverted Index Search”, and how Semantic Search is more or less self-learning. Source: Embedding-based-Retrieval-in-Facebook-Search (2020)

Let’s dig into the details more. The image above tries to show the differences between Lexical Search and Semantic Search. They both employ a Learning to Rank module on top. Hence the image has focused on the differences in the retrieval stage. (If you would like to understand the difference between retrieval and ranking, this article delves into it.)

At a high level what is different about semantic search is:

semantic search learns how to convert the query, the user history and information and the context into an embedding
retrieval happens only with these computed embeddings and not with keywords (lexical search only uses keyword based indexes).
semantic search gets smarter, without human intervention, by looking at what users click and learning how to change its encoder neural networks and parameters to prevent previously made mistakes.

^ This is an excellent tutorial in KDD 2020 by the Linkedin Search & Recommendations team that explains this in more detail. I would highly encourage readers to watch this series, and try the COLABs.

Weaknesses of semantic search

Semantic search is not new. Before Pagerank, the heart of Google Search came out, the same lab at Stanford had worked on semantic search & retrieval. More recently DSSM and generally the work on word embeddings has breathed new life into self-learning semantic search systems. But there are perceived weaknesses. The main weakness is that lexical search can build indexes for a large number of phrases. For instance, “The Matrix Resurrections” could be a phrase that lexical search can have in its index and it will take a lot of learning for semantic search (a.k.a. deep personalized retrieval, DPR) to achieve parity with lexical search on such phrases.

How do we solve this?

Some phrases are more important than others. For instance, I chose the name of an upcoming movie, “The Matrix Resurrections”, in the example above, and not a random combination of three words. Language models are known to encode human knowledge. We could submodularize semantic search into a step that learns how to extract a knowledge-aware embedding using language models. For instance, Sentence-BERT starts with BERT / RoBERTa and fine-tunes. Thus, semantic search is able to learn what something like “The Matrix Resurrections” means!
- More generally knowledge can be domain dependent. For instance, if you are building search for a social network, the users, groups, events in your social network would need to be entities in your knowledge module with learned embeddings. (Ref: Embedding-based Retrieval in Facebook Search)
In Salient Phrase Aware Dense Retrieval (SPAR), the authors take a very different and perhaps more first principles approach.
- We know that semantic search is great at learning from data.
- We know that lexical search is great at searching results of long special phrases.
- What if we had access to a good source of search queries, perhaps we could use lexical search as a teacher, and produce a set of training data for semantic search to learn from!

The best part of this system (SPAR) is that this uses the billions of human hours we have put in collectively in building great lexical search systems. It’s as if we have learned how to transfer the knowledge of lexical search into a semantic search system instead of having to compete with it!

A possible workflow for lexical search teaching semantic search

Often a drawback quoted for semantic search systems is that we need a lot of data to get started, and that we have to build a lexical search system. With the method proposed in SPAR, we can use semantic search from the start.

Given a corpus to search from, use TFIDF and domain experience to generate a set of queries that are likely in the corpus.
Using lexical search, generate the top ten results from the corpus for each of the queries.
Provide this list of {query, result} for each of the results retrieved in step 2 as positive examples to train a baseline semantic search system.
Optionally, the queries provided in step 3 can be sampled from the set of possible queries based on a sampling probability proportional to TFIDF or augmented with some manually chosen really important queries.

Conclusion

I believe the age of self-learning semantic search is here. We have the systems and models to build a high quality search engine with a lean team of engineers using semantic search. We can learn from our collective knowledge of “old search” and we need not throw away those learnings.

Hope this helps and feel free to reach out to me if you have follow ups

You can add comments or reach out to me on linkedin.

References

Embedding-based Retrieval in Facebook Search (2020)
Deep Structured Semantic Models (2013)
Towards Personalized and Semantic Retrieval (2020)
Index Structures for Information Filtering Under the Vector Space Model - Tak W. Yan and Hector Garcia-Molina (1994)
StarSpace: Embed All Things! (2017)
Salient Phrase Aware Dense Retrieval (SPAR) (2021)
Two-tower models for semantic retrieval (2021)
The PageRank Citation Ranking: Bringing Order to the Web (1998)
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (2019)
[Code] Building a search engine with Sentence-BERT (2020)

Disclaimer: These are the personal opinions of the author. Any assumptions, opinions stated here are theirs and not representative of their current or any prior employer(s).

Applied ML | Recommender systems

Discussion about this post