ML Design Interview | Design a short-video platform Part 2

Applied ML #16 | Part 2 ML Design of a short-video recommendation platform

Sep 11, 2021

In this note, we will discuss potential design considerations of building recommendations for short-videos e.g. Tiktok, Instagram Reels, Facebook Reels, YouTube Shorts.
This is part-2 of the series.

In the first article of the short-video series, “Using online learning in short-video recommendation”, we showed an MVP implementation that:

learns to give every video a fair shot at popularity
maximizes the inventory of fresh good videos so that users can find high quality videos every time they open the app
is a scalable system with serving latency under 50 milliseconds (99 percentile) and independent of the number of videos.

However, we identified various avenues of improvement, personalization and creator-side value to name a few. Before we dive into those, let’s take a step back and approach short-video recommender systems the way should start every ML design problem, i.e. from the objectives.

**Fig 1:** On a short video platform like TtkTok, videos can be created by anyone and (quickly) rise to millions of watches and likes. The video above was at one point the most liked video ever on Tiktok.

Objective

Entertain / Inform / Educate users
Help creators find audience for their videos, and optionally help creators in monetizing their work. However, monetization is out of scope of this ML design exercise.

Optimizing criteria

Users find content they were looking for:
1. Engagement:
  - Number of video-watches (either completely or for more that 10 seconds).
  - Total-watch time: Historically video platforms have set watch-time as Goal (Youtube since 2012). However on short-video platforms, watch-time is not very different than number-videos-watched (>10 sec) since all videos are roughly around 10 - 20 seconds on the platform.
  - Session-success-rate: Percentage of watch sessions with more than 100 seconds of watches. This could be considered the equivalent of Click-through-rate (CTR) in a user experience that immediately starts playing videos. If the UX is similar to Youtube where the user is recommended videos and they have to explicitly start watching a video, then page-level long-click CTR can be a better measure of session-success. Here long-click
2. Satisfaction:
  - Number of videos liked last month
  - Number of users who have liked videos last month
  - Like-rate on the platform: Number of thumbs-up / Number of videos watched
  - Dislike-rate: Number of thumbs-down / Number of videos watched
  - At-risk-rate: Number of users who have dislike-rate > 10% in the last week. This could be a leading indicator for churn-rate.
3. Responsibility:
  - Percentage of videos marked inappropriate / offensive should be less than 0.1%
Creators find audience for great content
1. Creator engagement:
  - Number of creators who have created videos last month
  - Number of videos created last month
2. Creator quality:
  - Like-rate weighted number of creators who have created videos last month
Platform
1. Retention:
  - Monthly active users
  - Daily active users
  - L30 i.e. the number of days users are active in a month
  - Churn-rate: perhaps defining 20 video watches a month as an active c user. If using revenue based metrics of churn, one could use watch-time as a proxy of revenue.
2. Inventory:
  - Number of highly liked / watched videos. This is especially useful for new users. Perhaps the like rate or watch rate should be measured on the past month so that older videos with declining like-rates are excluded.
  - Number of topics/categories for which the platform has highly liked/watched videos. This can help point to gaps in the inventory, and to incentivize creation of new content.

The categorization between user, creator and platform values could be considered loose. Some of these criteria measure more than one type of value.

Guardrails

Latency:
1. The first video should start playing in under 500 milliseconds (P99). That would imply a roughly 250 ms (P99) for the video recommendation.
2. Subsequent video plays probably should have slightly tighter latency requirements.
Availability: Virtually available all the time, including degraded service if needed.
Recency: How soon should videos start getting recommended after they have been uploaded.
Consistency:
1. If a video is deleted or marked inappropriate how soon is it taken down from recommendations
2. If a video is shown to a user it should not be shown to them in future sessions
Coverage: Healthy inventory across:
1. languages
2. countries
3. topics / genre

Estimation

In this section we will outline some hypothetical assumptions that might help us in making tradeoffs between different model types

Number of monthly users on the platform: 100 million
Number of active creators: 5 million
Number of videos created per day : 10 million
Ideally there should be a discussion of the monthly budget available to run the platform. This should inform decisions like freshness and caching of recommendations.

API

The service provided:

syntax = "proto3";

message RecsRequest {
  string userid = 1;
  int32 page_number = 2;
  int32 result_per_page = 3;
  int64 request_id = 4; 
}

message RecsResponse {
  message Result {
    string video_url = 1;   // to show on app
    string video_title = 2; // to show on app
    string video_stats = 3; // to show on app
  }
  int64 request_id = 1;        // needed to join with FE logs
  repeated Result results = 2; // ranked recommendations
}

service RecsService {  
  rpc GetVideos(RecsRequest) returns (RecsResponse);
}

Feature engineering / User expectations

To motivate the features we might think of using in the scoring model, we should look at expectations of different stakeholders in the system.

New users expect to see interesting videos without entering an interest profile.
Returning users expect to see high quality new videos every time they open the app.
Creators expect their videos to be shown to users who are interested in such content.
New amazing creations should be shown to users even if the creator has not created popular videos in the past.
Showing users videos they could mimic could encourage video creation.

Available data

Logged data of user events like

syntax = "proto3";  
message UserVideoEvent {
  enum EventType {
    Watch = 1;
    Like = 2;
    Share = 3;
    Comment = 4;
  }
  EventType evt_type = 1;
  string userid = 2;
  Timestamp evt_time = 3;
}

Static video features

syntax = "proto3";
message Video {
  string video_id = 1;
  string creator_id = 2;
  string title = 3;
  string thumbnail_url = 4;
  string content_url = 5;
}

Dynamic video features

syntax = "proto3"; 
message Video {
  string video_id = 1;   
  int32 watch_count = 2; // number of times watched >= 10s
  int32 rec_count = 3;   // number of times recommended
}

Partner teams, infrastructure and collaborations

For this design, we might assume the availability of:

product engg support
design
UX research
a taxonomy of categories of videos
a set of labelers who could categorize new videos based on this taxonomy. However the bandwidth of this team might only scale to 1% of the videos.
a video transcription service
a team that collaborates with the labeling team to build an ML driven video categorization service, such that virtually all videos that are eligible to be recommended come with categories auto-labeled.
ML infrastructure for training of two-tower models like Tensorflow recommenders.
Embedding based retrieval like Vertex Matching Engine (based on ScaNN an algorithm that handles multiple millions of queries per second at Google).
ML infrastructure to get predictions of custom-trained models (like Vertex AI or Sagemaker)
Logging of user interactions on app.

**Fig 2:** Image from MLOps course on Coursera which talks about the different aspects of bringing an ML system to production

Coming next - Detailed ML design

Now that we have a framework for measuring quality, in subsequent articles we will build on improvements like:

Separate videos into bucket-flow per category, learn the categories a user is interested in.
What if categories overlap? What if videos were wrongly categorized? How can we use our manual categorizing team efficiently? … i.e. how to maximize user relevance via embeddings.
Filtering, diversity, novelty, serendipity.
Encoding creator side value in the ranking function.

References and further reading

For other ML Design interview discussions …

This article was written in an ML design interview format. This format can also be helpful when writing an ML design doc for an internal project.

If you want to discuss more ML design questions, please join our Discord community here!

Disclaimer: These are the personal opinions of the author. Any assumptions, opinions stated here are theirs and not representative of their current or any prior employer(s).

(photo credits [1] [2])

Applied ML | Recommender systems

Discussion about this post