Personalized short-video recommender systems

To recap the previous two articles in this short-video recommendation series, part-1 "Using online learning in short-video recommendations” showed a great baseline solution that delivered fresh short video recommendations for potentially billions of users in under 50 milliseconds (P99) and part-2 “ML Design Interview | Design a short-video platform Part 2” approached it holistically describing what success means for all the three stakeholders : video-watchers, video-creators and the video-platform.

In this third and last part of this short-video recommender series, we will propose an ML model that caters to user-value, creator-value and platform-value.

Platform Value - Multi-category bucket-flows

Fig1: Bucket videos based on the uncertainty of the goodness of the videos. Don’t just select from the best bucket but from all buckets with some probability.

To simulate online learning at scale in part-1 we separated videos into buckets based on the uncertainty we have in estimating their goodness (watch-rate). See Fig 1 above. To address some of the shortcomings mentioned in “Further Improvements”, like personalization and videos of different categories might have different distributions of the watch-rate metric, we could use multiple bucket-sets. See Fig 2 below.

Fig 2: For each category of videos, have a bucketed stack like Fig 1. For each user, when they come to the app, look up the affinity of the user to each category of videos. Note that the affinity to each category won’t be a fixed number. It will be an average and a confidence interval around it. Pick (say) 3 categories for the user using Upper Confidence Bounds or Thompson Sampling (as described in article1) and repeat the video selection from buckets process (as described in article 1)

We have essentially replicated Fig 1 for each of the categories in our taxonomy. The only added step is that while serving recommendations to a user, we will need to select top K=3 categories for each user and there we could use online learning (UCB / Thompson Sampling) to learn the categories the user is interested in. (Ref: Deepmind video on online learning)

Can we do better?

  • What if videos don’t neatly fall into a category?

  • What if the initial categorization is not accurate?

  • What if the concepts / categories users are interested in changes over time?

These questions are not new and these are the ones that led recsys researchers to consider embeddings (Netflix prize) and their variational counterpart (to encode uncertainty). In the following section we will recap a model that maximizes user relevance.

User value - Maximizing relevance for the user

Fig 3: System design of recommender systems where recommendable items are embedded according to a two-tower model, then items are filtered for inappropriate content and having been previously watched by the user. Then they are scored to estimate metrics like probability of watching, liking, commenting, sharing. Then they are ranked (i.e. ‘ordered’ above) to maximize the value of the ordered list of videos to the user and to minimize inappropriate sequences. Image above is from a wonderful talk by Even Oldridge

The current state of the art in recommender systems is the four-stage design as shown above. It optimizes for user-relevance and is constantly learning from interactions. Two key modeling components are:

  1. a two-tower (dual-encoder) model trained to retrieve a set of about 100 videos for the given user1

    Fig 4: A Two tower / dual-encoder model could be trained to learn neural networks to encode user info into a user “embedding”, and similarly to convert video info into a video embedding.
  2. a scoring model (historically known as “ranking model”) to estimate the probability of the user liking/watching/sharing the video.

During retrieval, items are ordered by Prob-Watch(user, video), which is estimated by the dot-product of the user interest embedding and the video embedding.

Fig 5: The basic principle in the model above is to use all the raw data of user and the video being scored in a neural network scoring model. Instead of making separate neural networks for each estimation, using a common neural network base helps to regularize the model and improve out of sample performance especially for models which have sparse training data. Reference: Multi-task ranking system (Youtube)

As shown above, during scoring (a.k.a. ranking) Prob-Watch and other metrics can be estimated as the output of neural networks.

Note that we can choose to use the creator_id as a feature or not. If we do use it then the ranking model can reward creators with past successes with more views. If we don’t use it intentionally as a feature then videos of all creators will have a similar chance at popularity on the platform.

Fig 6: The above video shows the effects of personalization on a short-video platform (Youtube Shorts). The user is interested in machine learning and MLOps. Hence the first video they are recommended is on ML even though the number of likes of the video is just 5.

The above image shows an example of the successful application of personalization on short video platforms. The platform was able to show a niche video (just 5 likes) to a user since it strongly matched the user’s interests.

Personalized embedding + Bucket flows

Fig 7: A better way to do multi-category buckets (Fig 2), is to represent videos and users as embeddings (Fig 4 about training them). While training embeddings we learn not only the embeddings but also the variance of the embeddings (ref Auto-encoding Variational Bayes). The video is placed in uncertainty buckets based on this estimated uncertainty.

In Fig 7 above, we have improved upon Fig 2 using embedding training (Fig 4). During training, we learn not only the embeddings but also variance of the embeddings. On the video side, these uncertainty values helps us in putting them in the different buckets.

While serving, we could generate say 3 query vectors by sampling d[i] ~ U(0, 1) and vec[i] = user-embed + d[i] * stdev-user-embed. We could query the videos that have the highest dot-product with these query vectors.

If the variance of the user-embedding is high, we could also give a higher weight to diversity boosting.

Variational / Multi-modal user embeddings

There are two approaches commonly used to express in embeddings the fact that users have multiple interests. We talked about the variational approach in the previous section where we learn not just the user embedding but also the uncertainty along each dimension of the embedding.

Another approach is to learn multi-modal user embeddings, i.e. a small set of user embeddings that capture the different user interests. For instance, if Youtube sees my account watch Peppa Pig videos, Machine learning videos and NBA highlights, it could learn to represent me with three embeddings. (Ref: PinnerSage for more on this).

How to not rabbit-hole into a niche interest

This WSJ video describes the user experience of Tiktok, where it seems to learn quickly what the user wants to watch. However it seems to rabbit hole and go more and more in that direction. How can we avoid that? It is clear that this narrow watching intent is not sustainable, either for the user or the platform.

As mentioned in “Optimizing for the long term” section here, periodically trying to bring the user back towards broader interests, and discovering multiple interests of the user will help. While ranking videos, we could factor the inventory size around that video (platform won’t have too much inventory around very niche interests).

Now let’s come to the last and most crucial part of the short-video ecosystem, the creators.

Creator value - Maximizing # of active creators

Creators are incredibly valuable to a short-video platform, where a steady stream of high quality videos on a wide enough set of topics is what is needed to keep an engaged audience.

To maximize creator value, recommend videos to users such that it leads to more creation of videos. The platform should sponsor new creators in a unified ranking model.

The levers the platform has to incentivize video creation are:

  1. Boost videos of new creators to users who are their target audience. Seeing their videos appreciated might incentivize more creation.

  2. Show videos to potential creators who might use these to mimic and create more.

P(creator mimic v) can be modeled by a trainable neural network with both user and video inputs, similar to the scoring models in Fig 5 above.

Estimating the utility of a recommended view to a creator

The utility of a view to a creator derives from:

  • the probability of the user watching/liking the video

  • if they do like it then

    • the incremental utility of the watch/like to the creator’s interest in creating another video (due to the positive feedback provided by the platform).


Creator-Utility-Uplift (user u, video v of creator c) =
P-create(c | v watched by u) * P-watch(video by user u) + P-create(c | v skipped by u) * P-skip(video v skipped by user u) - P-create(c)

The expression above measures how the probability of creation of video increases due to the recommendation.

Similarity with ads

Note how creator value model is similar to online ads modeling. For instance the utility model of an online ad is:

Utility to ad-bidder(bidder b, user u, ad a) = P-click(u, a) * Incremental-utility(b, u)

This is also similar to “people you may know” modeling since that involves two sided value as well.

References and further reading

  1. Using online learning in short-video recommendations (Platform Value + MVP)

  2. ML Design Interview | Design a short-video recommender system (Start with the user problems)

  3. Upper confidence bounds

  4. Thompson Sampling

  5. [Video] Deepmind tutorial on online learning

  6. [Video] Four stage recommender system design (Even Oldridge)

  7. Variational auto-encoders (modeling uncertainty of embeddings)

  8. Two-tower models for recommendations (Efficient self-learning retrieval)

  9. StarSpace - Embedding bipartite relationships using a dual encoder model

  10. Deep Neural Networks for Youtube Recommendations

  11. Recommending what video to watch next (Youtube)

  12. PinnerSage: Multi-modal user embedding framework for recommendations at Pinterest

  13. Incorporating product mgmt into ML


Another option would be to have multiple stages of ranking. For instance the retrieval stage could start with say 2k results and we could put in 4 stages of ranking and pruning that finally result in about 30 videos or so. This approach was used in massive recommender systems like Facebook earlier since this is computationally efficient. However, the resulting set of results is not optimal as a set since we have been pruning results individual based on quality. Hence companies are movies to a two stage ranking: retrieval with say 2k results and then ranking (multi-dimensional scoring) and then diversity boosting. Read more here.