Recommendations using graph neural networks

personalized recommendations #8

In this post we will show how to build personalized recommendations using a graph based approach.

  • Most companies use a “two-tower model” today for recommendations.

  • A two-tower model can be seen as a simplification of a Graph Neural Network (GNN).

  • Once you start looking at your personalization engine as a graph, it opens up possibilities of improving the recommender system further.

  • It will allow you to understand the nature of the different entities in your recommendation domain and how these relate to each other.

Outline:

  • Section 1: we will describe a basic personalized recommendation system using graphs, as utilized in companies like Youtube, Pinterest and Uber.

  • Section 2: we will explain how the two-tower model, discussed in our previous post, compares to GNNs.

  • Section 3: will delve deeper into the theory of graph neural networks, showing how message passing between nodes in a graph and Graph convolutions has led to advances as impactful as AlphaFold (Explanatory blog, Video).

  • Section 4: we will walk-through an example of an event-based approach to information flow in a graph that was developed by researchers at Twitter.

Before we start a small refresher of what is a “graph”. A graph refers to a set of items, the “nodes” in the graph, some of which are connected to each other via “edges”. These edges are the structure that makes the graph special, makes it different from just having a set of items to recommend.

To see an example of where edges are relevant to recommendation, think of Twitter. Let’s say, you , a Twitter user, would follow people, and hence add edges to Twitter’s graph. Twitter is using these “edges” to show you relevant tweets.

1. Using graphs to build personalized recs
(item-item similarity / label propagation )

The earliest successful approaches of personalized recommender systems involved seeing which items are similar to the ones the user has already liked (Refer to post #1 to dig into item-item similarity based personalized recs). The main idea in this approach is to …

… make a graph of the items we can recommend. Add edges between them based on what users considered similar. Find items in this graph close to the items the user has liked already.

To take a concrete example, in the paper The YouTube Video Recommendation System (Davidson et al. 2010), the Youtube team looks at videos as nodes in a graph. Every time a user watches a video v1 and v2 in a 24 hour period, they add an edge from node(v1) to node(v2) in the graph. If the edge is already there, then they increase its weight.

Fig 1: You could look at the set of items you want to recommend as nodes in a graph. You could then figure out some automatic approach to inferring edges and weights from the data you have. If we set up edges based on user activity, the recommendations will be based on semantic similarity. If we set up edges based on similar metadata like similarity of title or categories, the recommendations would capture something else.

When it comes to recommending items to a user, they recommend the videos that have edges or short walks ending at them starting from the videos the user has watched recently.

Fig 2: Performing inference of personalized recommendations on the same graph as Fig 1. The user has watched videos 1 and 3 before. We start from those and sample some short random walks that end in videos 2, 4 and 5. These are the ones we recommend.

This is an instance of “label propagation” on graphs. Label propagation refers to a case when we have labels for a few nodes from the training data and for the remaining points we are inferring the labels by using the graph structure. In this case, we already have the label “watched by this user” for videos 1 and 3. We are sampling a few paths in the graph based on the edge weights and propagating the label “can be watched by the user” to other nodes our paths take us to.

The basic idea is remarkably simple and intuitive. For example, in the context of video recommendation:

  1. Just build a graph with nodes corresponding to videos and keep adding to the graph as new recommendable videos are being added to the platform.

  2. Keep adding counts to the edges (a, b) to the graph as users watch video b soon after watching video a.

How to recommend (Serving)

Now at serving time, start with the videos in the user’s history and sample short random walks on this graph and recommend the videos you end at.

Other possible improvements:

  1. Diversity boosting:
    As mentioned in PinnerSage (by Pinterest/Stanford) team, instead of following paths from all items in user history, you could choose some representative items in history, perhaps by clustering, perhaps by recency. For example if my account (my daughter!) has watched thousands of Peppa Pig videos, I can just use the most recent one to find recommendations.

  2. Freshness:
    Keep removing 10% of “old” edges perdiocally. Example: Every day among edges that are a month old, remove 10% of them. Remove 10% of the edges that are 2 months old and so on. This will ensure freshness in your recommendations.

  3. Improve new user experience:
    New users to your platform won’t have any seed items to start search paths from. To be able to recommend to them, add a node “No video watched” to the graph and add an edge from this node to the first video a user is watching on your platform. This will help you recommend good videos to new users. This segment is particularly important for the success of your platform.

If you are interested to learn more, please read the “Further Reading” section at the end of the post.

2. Two Tower models vs GNNs

At this point, you are probably thinking about pros and cons of using a GNN vs a two-tower model for personalized recs.

Fig 3: A “two-tower” model tries to encode the information we have for a user into an embedding for that user, and similarly learn an embedding for each item in a way that the user embedding is closer to the items the user is likely to select and far from the ones the user is not interested in.

To recap from post #4, as shown in Fig 3 above, a two-tower model is a way to build recommender systems that has achieved state of the art results in platforms like Youtube and online ads (Pinterest). Just like GNNs it tries to learn embeddings of user and item nodes that are outputs of neural networks.

Fig 4: This is a GNN showing how we might recommend items to users 1 and 2. User 1 has previously liked items 1 and 3. User 2 has previously liked items 3 and 4. While recommending items to User 1, we will try to see which of the item nodes not in history (i.e. 2 and 4) have embeddings with the highest dot-product with the embedding for user 1. These prospective edges in the graph have been shown with green dashed lines.

As I have tried to show in Fig 4, a GNN can represent what a two-tower model is able to model. A two-tower model is exactly what a GNN model would be if you had exactly two types of nodes in your graph and all edges were between user and item nodes, i.e. if your graph were bipartite between user and item nodes.

In fact, you could think of a two tower model being able to capture more than bipartite graphs. As long as the information flow is acyclic, you could capture it with a two tower model. For instance, if you were trying to make a recommendation system in a domain where users can search items and items have publishers / creators, we would have four types of entities:

  1. users

  2. items

  3. publishers

  4. nodes for text entities that could be searched or could appear in the title / description of items.

  5. nodes for global context like “Christmas” / “Superbowl”

Fig 5: Showing how a Graph Neural Network with multiple types of nodes can represent the entities in your recommendation domain more accurately. You can then learn embeddings for publishers and search text as well. You can add custom features to each type of node. Keyword nodes could have pretrained BERT embeddings as input features for instance.

This can be modeled by a two-tower model. However as you can see in Fig 5 above, this is modeled more naturally using a GNN with different node types and features.

Astute observers might be wondering if GNNs allow the encoder for user nodes to be different than the encoder for an item. The answer is yes. That is because while technically encoders for all nodes in the graph are the same, one of the inputs to the encoder is the type of the node as an input feature. Hence we can build a pathway dependent upon the node type.

3. Theoretical foundations of Graph Neural Networks

Please see the slides here for a deep dive presentation on GNNs.

Besides the Youtube paper above, a few good papers on this topic are:

  1. DeepWalk - Social Recommendations (2014) : This uses the SkipGram approach from NLP to find embeddings of nodes that make sampled random walks of the Graph consistent.

  2. Node2Vec (2016): Expanding on DeepWalk but noting that the learned representation will be different if we focus on depth or breadth in sampling. (Read this excellent post to understand the difference)

  3. Graph Convolutional Networks - Kipf et al. 2016: An efficient variant of convolutional neural networks operating on graph structured data.

  4. Neural Graph Machines (2017): Learns neural networks to map node features into embeddings in a way that are similar between nodes that are connected by an edge in the graph.

In this video (at time 25:41), Petar Veličković (DeepMind) talks about how to train node embeddings of the graph such that the dot-product of nodes with an edge are high and the dot product of nodes without an edge should be low. He also elaborates how DeepWalk and Node2Vec generalize this notion of structure.

4. An event-based graph neural network

Motivation

  1. Changes happen due to events:
    If you look at Fig 4, the only time the user embedding would change is if any of the inputs to the “user-info encoder” would change. Since the user features, like location and language are more or less static after onboarding, the main changes in inputs could be due to
    (a) adding items to user history or
    (b) due to changes of embeddings of items in user history or
    (c) due to changes in global context like Christmas / Super Bowl etc.
    All of these can be represented in a graph.

  2. Training speed:
    A usual complaint from practitioners for the first generation of Graoh Neural Networks (like GCNs) was that training them is very slow. In a recent work, Temporal Graph Neural Networks (blog post and research paper) by Rossi et al. 2020 build a way to train GNNs which is event based and trains fast. For instance when a user likes an item, only the embeddings of the user node and the item node are recomputed.

  3. Better metrics:
    Researchers at Twitter have observed over 50% improvement in metrics using TGN (Fig 6).

Fig 6: Researchers at Twitter observe huge improvements in performance metrics by using TGNs on real world data sets where the graph is constantly changing and being able to recommend new items better matters disproportionately more.

If you are building a recommender system for a service at scale, I recommend using Temporal Graph Networks (TGNs).

Conclusion

We showed how to look at users, items and context of a personalized recommendation system as a graph. We showed how graphs based recommendations are at least as powerful as the state of the art two-tower model. Leading recsys platforms have observed over 50% gain in performance metrics using it (Fig 6).

Disclaimer: These are my personal opinions only. Any assumptions, opinions stated here are mine and not representative of my current or any prior employer(s).