How to reduce cost of ranking by knowledge distillation

Using knowledge distillation can make your early ranking model more aligned with final ranker + Sample Code + Video walkthrough

Jan 06, 2024

There is an eagerness in industry to apply AI but the costs can quickly become prohibitive. In this post, we will show to apply ML in a cost effective manner in the domain of ranking.

We have been setting the stage for this in the previous two posts “What is an Early Ranker” and “System design of an Early Ranker”. To recap, an early ranker is used to filter down from say 2,000 candidates to about 200 which are then ranked by the compute-intensive final ranker. The early ranker uses typically about 5% compute cost per item compared to the final ranker.

In this post we will talk about how to train an early ranker to be a good approximation of the final ranker. This will help because

It will improve topline metrics since items selected by Early Ranker will not be filtered by Final Ranker.
It can further reduce costs if you tune down the number of items selected from Early Ranker.

Sample code

PTAL original GitHub code here

Key idea: Teach Early Ranker to learn from Final Ranker (not just user actions)

Goodness of an early ranker is how good a proxy it is for the final ranker. A common way to describe this in Machine Learning is Knowledge Distillation, where a student model is trained to learn from a teacher model.

By “teacher model” we are referring to the final ranker. This is a high accuracy model but costly to run for too many items.
By “student model” we are referring to the early ranker.
The key idea here is that when you are evaluating (and also training) an early ranker, don’t just evaluate based on accuracy of predicting user actions on the recommendations but also on how closely its predictions match the teacher.

Current state: Early and Final Ranker are both trained from user actions

Usually we train the early ranker with the same inputs and training data as the final ranker, which is to emulate users’ actions.

For instance labels in a video recommendation feed would be like:

Did the user watch the video
Did the user like the video
Did the user share the video, etc

These are binary labels and hence you can use binary cross entropy loss to train the models.

Fig 1: Normally you will find Final Ranker and Early Ranker, which are really Final Task Estimator model and Early Task Estimator model both trained to minimize Normalized Binary Cross Entropy with user labels from impression data.

Problem : Alignment

Problem: Upon training, Early Ranker will lose alignment with Final Ranker. Consequently the Normalized Binary Cross Entropy gap between Early Ranker and User might be only 1% more than Final Ranker and User labels on impressed items, but the Normalized KL Divergence between Early Ranker and Final Ranker could be much more, like 4%.

As illustrated in the image above, for a well trained system you might observe an alignment problem. i.e. You might observe is that the difference between cross entropy(user labels, final ranker) and cross entropy(user labels, early ranker) might be less than 1/4 times the cross entropy(final ranker, early ranker).

The early ranker learns to predict user actions well but not the final ranker. Hence many candidates it picks out are rejected by the final ranker.

Solution: “Knowledge distillation” to train Early Ranker from Final Ranker

Fig 2: The key idea here is to log Final Ranker’s task estimates and also use them as training labels for Early Ranker

The student model should be trained to stay close to the teacher model’s predictions.

For this you will need to:

log the final ranker scores alongside features for each item the user was recommended. See this article to understand why and how to set up feature logging alongside every recommendation.
add training labels to check closeness of the student’s model with teachers.
- labels like
  - teacher_probability_watch
  - teacher_probability_like
  - teacher_probability_share
- training data point weights to ensure that we only consider training examples where the probabilities were logged properly.
- Hence offline metrics for training student model should include:
  - binary cross entropy against user labels (e.g. watch, like, share)
  - binary cross entropy against teacher labels (e.g. teacher_probability_watch, teacher_probability_like, teacher_probability_share)

Shared tasks vs Auxiliary tasks

You should try two approaches to knowledge distillation from final ranker to early ranker.

[Aux] Using teacher labels as auxiliary tasks

In “Aux” approach, we duplicate the number of task heads of the multi task model of the early ranker. The predictions of user labels improve and get more aligned with the final ranker because the lower layers of the model learn to mimic the final ranker. “UL1” refers to the logit from the multi-task model to predict “User Label 1”. “TL1” refers to the logit from the multi-task model to predict "“Teacher Label 1”

Pros of Aux approach:

Since the task predictions are separate, there is a little more safety in this approach in case something is messed up in the teacher logits.

[Shared] Using the same task estimate for both user and teacher label

In “Shared” approach, we keep the number of task heads of the multi-task ranker the same. We copy over the logits and translate them via a learned calibration layer. teacher_logit = user_logit * w + b

Pros of shared approach

This will lead to a higher alignment
There is no increase in compute cost since the number of task heads have not increased.

References

For a more in-depth treatment of knowledge distillation, please read “Knowledge Distillation in Recommendation Systems” by Rahul.

Disclaimer: These are the personal opinions of the author(s). Any assumptions, opinions stated here are theirs and not representative of their current or any prior employer(s). Apart from publicly available information, any other information here is not claimed to refer to any company including ones the author(s) may have worked in or been associated with.

Applied ML | Recommender systems

Discussion about this post