Two tower models for retrieval of recommendations

ML in recommender systems #4

The first tech stack you should build today for personalized recommendations is retrieval using two tower models[12] and ranking using gradient boosted trees. In this article we will learn about two-tower models and ranking will be covered in a future post. Using two tower models has helped leading tech companies improve the quality of their recommendations, online ads and search.

In previous posts in this series on personalized recommendations, we covered:

  1. foundational models (user-user, item-item) to build a baseline model for personalized recommendations

  2. the system architecture that one needs to build to serve personalized recommendations at a low latency

  3. how to improve the recommendation model from user feedback, and why separating recommendations into two stages, retrieval and ranking allows us to optimize the two types of losses separately. (Type 1: the user was looking for an item that was nowhere on the list. Type 2: the item the user was looking for was ranked lower in the recommended list of items.)

Recap on recommendations, personalization, retrieval and ranking

In case you have not read the previous posts, here's a quick recap of what we are talking about. You are building a content app, a platform to connect users with the content they are looking for. Towards that end you build a home page for a user, and you want this to list content items on it in a way that the items most likely to be wanted by the user are on top. You are trying to provide a personalized experience, as opposed to showing the same items and order to every user. Since there are millions (billions? nice!) of content items to choose from the user, one cannot do easily a thorough ranking of all of them at serving time. That would be too slow. Hence you break the system into two stages: (a) retrieval (a.k.a. candidate-generation) that returns a few hundred items (b) ranking (a.k.a. re-ranking) which is a thorough process of looking at all retrieved items and considering all their features and user features in deciding how to order them in the final list.

Two tower models to compute user and item embeddings

The aim here is to train two neural networks, a user encoder and an item encoder such that items with embeddings very similar to the user are a great recommendation for the user. We will first show the schematic of these two encoders and then briefly go into how these are trained.

Encoding user information

What does the recommender system know about the user?

  1. The history of the items the user liked and time stamps (relative to now)

  2. What the user may have searched in the past

  3. The location of the user if shared

  4. The preferred languages of the user if shared / implicitly detected

  5. Other metadata of the user if shared like genres preferred, or past employment history etc, whatever is relevant to the recommendation platform we are developing.

In a social context like Twitter, one could also potentially consider attributes of other users that this user is following / connected to.

Fig 1: The image above shows a trainable neural network (encoder) that learns to take all the information about the user and make a fixed size vector from it.

Encoding item information

What does the recommender system know about a recommendable item?

  1. The title, description of the item

  2. Other metadata like the language, publisher etc.

  3. Dynamic metadata like the number of views / likes by time.

Note that technically we could also add an item side feature of the users who have interacted with / liked the item. However, this would be superfluous.

Fig 2: The image above shows a trainable neural network in green that can be learned to produce a fixed size embedding of every time. This neural network basically learns to get the essence of all the information of the item that is relevant to recommendation.

Putting it all together

Fig 3: The image above just joins the two 'towers'/encoders of Fig 1 and Fig 2. Thus for a pair of user and item, we can pass them through these towers to get a fixed size user embedding and an item embedding of the same size. Then we compute a dot-product of these two vectors. If this dot-product is high then we say that this item is a good match for the user.

Training the neural networks

These encoders are trained to produce embeddings such that the dot product of a user, item pair that actually interacted is high and the dot product of a non interacting pair is close to 0.

To get "positive examples", i.e. examples of user and item pairs the network should learn to recommend, we can look at what a user has interacted with in the past. You must be thinking "But wait, if I have watched Avatar, we will train the network to recommend Avatar again?". Not really. What we do is that while training the model with the positive example {you, Avatar} we only use the features as they were just before you watched Avatar.

If we only had positive examples to train with the network will just learn to say +1 to every user-item pair and that would not be a good recommendation. We need the networks to learn to demote poor recommendations. There are two ways in which practitioners generate "negative examples":

  1. Randomly pick items not watched by the user (Please read this brilliant summary on negative "candidate sampling")

  2. Pick items that the user has been presented by the app but the user has chosen not to watch (a.k.a. "negative impressions").

Typically the first approach, negative sampling works better in the initial stages. Once the model has already reached a high level of recall, a bit of negative impression based training might further improve the model. Just using impressions might not work. This is why. Imagine that the model learns that you like Sci-fi movies and suggests a few, and you don't watch some of those. Now suppose there is a genre, I don't know let's say romcom that you simply can't stand. While training your embedding if you make the model learn that your second best Sci-fi option which you did not click since you chose the best is as bad as every romcom movie, then this embedding method will be super confused and will not know where to put you. So as a rule of thumb, if you have to do one thing, use option 1, random negative sampling to pick your negative examples, and perhaps rely on the reranking layer to learn from negative impressions.

How to build two tower models

Personally I would recommend building with an open source implementation like Tensorflow Recommenders first before trying to write it yourself. The challenge is not insurmountable but there are a few ways to trip when doing it from scratch.

What led to two-tower models

Matrix factorization lineage

As we saw in the first post, foundational models (user-user, item-item) of recommendations, given an interaction history of users with items, one can use item to item similarity and user to user similarity to retrieve personally relevant items. (Seminal paper on this).

Wouldn't it be great if we somehow had a shared set of categories, where each user painstakingly marks for each category whether they like it or not in a -1 to 1 scale, and for each movie the In seminal work done in connection with the Netflix Prize, the authors learned a way of inferring these magical categories. We will henceforth call them "embeddings". The researchers used (low-rank) matrix-factorization to decompose the user-item interaction matrix into these user and item embeddings.

Fig 4: The image above (sourced from shows how low-rank matrix factorization works. In this example, we are given a user-item matrix of 4 rows and 5 columns with 1s where the user has watched the item. However it learns a 2 dimensional user-embedding and and 2-dimensional item-embedding for each item such that the dot product of the user item embedding is very similar to the given matrix.

While Matrix Factorization worked well for smaller datasets, in real world applications practitioners observed some problems. (a) The distribution of information is very skewed. There are a lot of views for a few videos for instance and very little watch history of millions of items. (b) They also observed that computationally a weighted alternating least squares minimization process is much more efficient and allows a distributed implementation.

The biggest problem though was that Matrix Factorization or its improved variant Weighted Alternating Least Squares is their inability to factor side information, i.e. information other than the user-item interaction like the title, description, categories, language, user location, user searches, etc. This is what two tower models fix.

For those of you who want to get a feel of this lineage, this short (4 hours) course should be well worth your time.

Graph traversal

Another lineage of ideas goes through thinking about the graph of recommendable items. We will cover these in more detail in the post on using Graph Neural Networks for personalized recommendations. However for eager readers, connect the dots between these papers:

  1. The YouTube Video Recommendation System

  2. DeepWalk

  3. GraphSage

Ideas from natural language processing

Finding embeddings such that the dot product correlates to the task at hand are popular in NLP. StarSpace by Chopra et al. is a wonderful paper showing areas where this approach has worked.

This was originally posted on Linkedin.

Disclaimer: These are my personal opinions only. Any assumptions, opinions stated here are mine and not representative of my current or any prior employer(s).