Declarative Value-Model Tuning

Code to show a couple of approaches to achieve the desired task importance in value model

Gaurav Chakravorty

Penghao Xu

, and

Benjamin

Sep 10, 2024

Normally Value Models in recommender systems are hand tuned. We provide a couple of utilities to derive VM weights from targeted task importance.

Context

Value model (VM) weights are used to combine multiple task predictions in a recommender system. For instance the following could be a config to produce a ranked list by (0.1 * P(watch > 3s) + 0.3 * P(watch > 30s) + 20 * P(watch & share) + 2 * P(watch & like) + 5 * P(watch & follow)).

{
  "weights": {
    "p_watch_3s": 0.1,
    "p_watch_30s": 0.3,
    "p_watch_and_share": 20,
    "p_watch_and_like": 2,
    "p_watch_and_follow": 5
  }
}

More info on VM here and here.

Normal workflow

Normally VM weights are tuned by either a grid search or by multiplying by 2 or 1/2 times the current task weight and running experiments.

What could be empowering is if practitioners had a tool to specify the desired importance of each task and compute the VM weights from that. This could be a great baseline to jump to and then search nearby this weight.

Two approaches to task importance

In github.com/gauravchak/value_model_tuning, we look at two approaches for declarative VM tuning.

NDCG Gap Targeting: This computes a leave one out ranking for each task and then computes the NDCG gap of this ranking from the current ranking. This gap or difference is the importance of this task. Then it tries to adjust weights to achieve the desired relative gaps/deltas.
Per task regret targeting: In the spirit of this Google-Deepmind paper, this describes the regret of current weights per task as compared to ranking purely based on that task, how much worse is the current ranking doing. It then adjusts weights so that these per-task regrets have the relative importance specified by the user.

Disclaimer: These are the personal opinions of the author(s). Any assumptions, opinions stated here are theirs and not representative of or attributable to their current or any prior employer(s). Apart from publicly available information, any other information here is not claimed to refer to any company including ones the author(s) may have worked in or been associated with.

A guest post by

Penghao Xu

A guest post by

Benjamin

random developer

Sep 16

Could you also talk a bit about convergence of the approach. what if desired ndcg is not achievable i.e reducing ndcg regret for one component increases ndcg regret for another , how do we ensure the solution will converge ?

the components are highly correlated as well, does this method assume indepdence among different components going into value model ?

Expand full comment

Sep 11

Few questions :

1) Curios when using slsqp, does the objective fun needs to be differentiable , how would that work when using ndcg. IIRC sqlsp assumes f(x) needs to be differentiable and approximates the linearizations , i am not sure how are we able to use ndcg here

2) Also it seems the problem changed from finding weights to finding desired ndcg regrets or importance to begin with ? how does one come up with those desired target weights ? It seems the desired weights also needs to be personalized based on user and videos

3) What if we want weights to be negative as well to demote click baits or low quality

4) the regret based approach : does it have any relation to scalarization, essentially what it seems to me is we are taking the multiobjective function and scalarizing it : sum ( lambda * MSE(1 - ndcg(task_rank, current_rank) . The part i dont understand is how does this work or converge when ndcg itself is non diffeerntiable.

Applied ML | Recommender systems

Discussion about this post