Reinforced

Positive Gradients, Negative Gradients

Alex Nikulkov — Fri, 19 Dec 2025 06:17:30 GMT

In Reinforcement Learning (RL), particularly when fine-tuning Large Language Models (LLMs), we often treat positive and negative feedback as two sides of the same coin. Mathematically, the transition from positive to negative examples in the loss function is smooth. Whether you are maximizing the log-probability of a “good” response or minimizing the log-probability of a “bad” one, the gradient updates look qualitatively similar. However, their effects on the model couldn’t be more different.

Understanding this asymmetry explains why RL often leads to model collapse and why the prior established during pre-training deserves much more attention than it currently gets. The core difference lies in how the gradients affect the output distribution.

When we train on on-policy positive examples, we are reinforcing samples that already lie in high-density regions of the output distribution (since the model generated them). Applying a positive gradient here effectively says, “Do exactly this, but more.” This creates a “rich get richer“ effect. The model pushes the peaks of the probability distribution even higher. Because probability must sum to 1, this mass is taken from the tail of the distribution. The result is a rapid reduction in entropy and a collapse in diversity. The model becomes extremely confident in its current path.

Negative gradients operate differently. When we apply a negative gradient to an on-policy sample (which, again, is likely a high-probability error), we are chopping off the peak. The optimization process is forced to redistribute that probability mass elsewhere to maintain normalization. The loss function doesn’t specify where that mass should go, only that it cannot stay at the erroneous location. This forces the model to lift the lower-probability regions, naturally flattening the distribution and increasing the entropy. It’s important to note here that when the probability gets redistributed, it still generally remains within the support of the output distribution, with lower-prior regions receiving less of the redistribution.

The figure below illustrates the effects of positive (green) and negative (red) on-policy examples. The positive examples locally “absorb” the probability density, while the negative examples “expel” the density.

The effect of positive and negative examples on the probability density

Evidence from the Literature

The paper “The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning” investigates this exact phenomenon. The authors find that positive on-policy examples quickly collapse entropy. While this might sharpen the model’s best guess (improving pass@1), it destroys the diversity needed for test-time compute scaling (hurting pass@k for large k).

Conversely, negative examples effectively prune the “wrong” modes without collapsing the distribution. They maintain higher entropy, preserving the model’s ability to generate diverse candidate solutions, which is critical for difficult reasoning problems where the first guess isn’t always right.

Figure 5 below shows that Positive Sample Reinforcement (PSR, on-policy RL on positive examples only) collapses the entropy and leads only to a small increase in accuracy, while Negative Sample Reinforcement (NSR, on-policy RL on negative examples only) avoids the entropy collapse and reaches the accuracy levels similar to PPO/GRPO. The authors make an analogy PSR being Exploitation and NSR being Exploration. PSR reinforces discovered positive behaviors, while NSR shifts the probability away from discovered negative behaviors and the allocation of this probability to other outputs causes some exploration. An important caveat is that NSR redistributes probability within the support of the policy, so the extent of exploration is fairly limited.

Figure 5 from The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

This aligns with findings from “Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?”. This paper suggests that RL doesn’t teach the model new reasoning behaviors. Instead, it prioritizes useful patterns that already exist after pre-training (via positive gradients) and prunes incorrect ones (via negative gradients). Figure 1 illustrate this - RLVR training increases pass@1 rates, but reduces pass@k rates for large k>64. Additionally, Table 2 shows that for AIME24 there isn’t a single problem which the base model couldn’t solve (with a large number of samples k=1024), but the post-RLVR model can solve. Similarly, for MATH500 the post-RLVR model gains the ability to solve only 1% of problems which couldn’t be solved by the base model, but at the same time it loses the ability to solve 3.6% of problems which the pre-trained model could solve.

Figure 1 from “Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?”

Table 2 from “Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?”

The bottom line is fairly clear - on-policy RVLR reliably increases the accuracy (pass@1), but it reduces generation diversity, leading to a reduction in pass@k for large k. This training recipe can give us a reliable solver of simple problems, but it won’t produce a model which can get even close to pushing the frontier of knowledge.

Training Example Sources: On-Policy vs Off-Policy

The patterns described above apply only to on-policy training data. Off-policy training examples are substantially different.

Off-policy positive examples are used for Supervised Fine-Tuning (SFT) and are generally known to improve both accuracy and diversity (unless we overfit by training for too many epochs). If we show the model a “Gold” solution that it currently assigns low probability to, we are forcing it to expand the support of its distribution. This increases diversity by teaching the model a new mode it hadn’t discovered on its own. The main downside of off-policy positive examples is that they by themselves don’t induce good generalization due to lack of pruning via negative on-policy examples.

Off-policy negative examples are not well understood. My intuition here is that if you penalize a behavior which the model essentially never does (low probability mass), the gradient should be negligible. But such examples are a core part of some successful DPO-baed training recipes and their effect deserves a more thorough investigation.

We end up with the following priority list of types of training examples:

Negative On-Policy: High utility. Prunes errors, maintains diversity.
Positive Off-Policy: High utility. Expands diversity, teaches new behaviors.
Positive On-Policy: Mixed utility. Sharpens pass@1 but risks mode collapse.
Negative Off-Policy: Unclear utility. Needs more research.

The Prior

If on-policy RL is primarily a mechanism for reinforcing existing good behaviors (sharpening peaks) or pruning wrong ones (cutting peaks), then we are left with an uncomfortable conclusion: it can’t learn qualitatively new behaviors.

The irony here is that a tight prior from pre-training is the very thing that makes RL possible for LLMs in the first place. After pre-training most of the output probability gets concentrated in just a few tokens and this effectively shrinks the action space to 2-5 options at each position, compared to the full unconstrained vocabulary size of ~130k tokens. It would be hopeless to start RL training from a uniform prior over a discrete action space with 130k elements, especially when you consider long trajectories with only terminal rewards.

The effectiveness of RL post-training pipeline is strictly bounded by the support of the prior produced by pre-training. To break these bounds, we cannot simply rely on standard LLM RL with high-temperature sampling. We have two main paths forward:

Fix the prior: We can encourage significantly more diversity during pre-training, ensuring the initial distribution extensively covers the space of responses. Synthetic data augmentation for pre-training could potentially help here, but we’d need to be careful to avoid regurgitating the data without substantially transforming it. Also, this introduces a risk of capability dilution if the output distribution includes too many low-quality outputs.
Break out from the prior: A much more promising path, in my opinion. We need advanced exploration methods that go beyond the model’s current policy (true off-policy exploration) to expand the support of the output distribution, rather than just reshaping what is already there. The most useful thing would be to generate positive examples which are slightly off-policy to gradually expand the output distribution towards correct responses.

Subscribe now

Bandits vs Reinforcement Learning from Human Feedback

Alex Nikulkov — Tue, 30 Apr 2024 14:01:32 GMT

Reinforcement Learning from Human Feedback (RLHF) has caused a surge of interest in Reinforcement Learning (RL) algorithms, specifically the Proximal Policy Optimization (PPO) [1], popularized for RLHF by the InstructGPT [2] paper. The success of InstructGPT and its commercial successor ChatGPT has made PPO the de facto standard algorithm for LLM fine-tuning. But a recent paper [3] questioned whether the complexity of PPO is necessary, and has shown that a simpler REINFORCE algorithm applied to a contextual bandit model works better. In this post, I’ll share the main ideas from this paper and discuss the differences between using contextual bandit and multi-step RL for LLM fine-tuning.

REINFORCE paper

Cohere has recently published a paper “Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs” [3], in which they argue that REINFORCE should replace PPO as the default RLHF algorithm. Moreover, they suggest that multi-step RL algorithms aren’t necessary for LLM fine-tuning and simpler single-step contextual bandit algorithms can be used instead. Even though REINFORCE is usually used as a multi-step RL algorithm, this paper uses its simplified contextual bandit version. Their arguments center around the fact that PPO was developed for different types of problems like robotics and control, which have their own challenges distinct from RLHF. Quoting from the paper:

We note that PPO, as an approach, emphasizes stability across iterations, aiming to train an effective policy with the premise of small, stable updates. PPO was designed for a regime where off-policy gradient updates are large enough to introduce instability. This regime dominates traditional Deep-RL benchmarks [...] However, in this work, we posit that the setting of RLHF, which involves fine-tuning a pre-trained LLM, is lacking in these characteristics.

They identify 3 distinct components of PPO (see the PPO objective above) and show evidence that each of these components is unnecessary for RLHF of pre-trained LLMs. The 3 components are:

Clipped off-policy probability ratios. They found that clipping was applied <5% of the time in their RLHF runs, so it’s rarely necessary. They found the algorithm to work better when clipping was turned off. This shows that large off-policy updates are uncommon when fine-tuning pre-trained LLMs.
1. Moreover, they found that the probability ratios were close to 1 most of the time, so they removed the ratios from the loss without any impact on learning stability or quality. This switched the algorithm from being off-policy to a simpler on-policy PG algorithm.
Generalized Advantage Estimation (GAE) [4] and separate value function. The advantage term in PPO is estimated by GAE, which bootstraps from the learned value function to reduce variance. GAE has a hyperparameter lambda, which controls the bias-variance tradeoff. Turns out that setting lambda=1 results in the best performance. Under this setting the algorithm completely ignores the value function and just uses the reward model (highest variance, lowest bias).
Multi-step RL modeling. In PPO it’s assumed that text generation is a token-by-token multi-step process. The paper shows that this is unnecessary and a simpler bandit approach that models the whole response as a single step consistently outperforms the multi-step RL approach. This allows us to avoid dealing with partial responses and only consider full responses.

One of the core arguments about the ineffectiveness of PPO for LLM fine-tuning is related to its focus on reducing variance. It turns out that the RLHF of LLMs is much less exposed to high variance of gradients than classical deep RL environments, so it’s not worth it to increase bias for the sake of reducing variance. The main reasons for the low gradient variance during RLHF for pre-trained LLMs are:

High quality of pre-trained LLMs. Most classical RL environments cannot pre-train the agents before starting RL training, so the initial policy is very poor and requires variance reductions to prevent destructively large gradient updates.
Small size of effective action space. While the full action space size at each decoding step is equal to the number of tokens in the dictionary (on the order of 10k-100k), pre-training leads to a strong concentration of probability mass in just a handful of tokens, making the action space effectively smaller. In their experiments, 60% of probability mass was concentrated in the top token and 90%+ in the top 16 tokens.

Comparison of RL and Bandits for LLM

Why do we even have a choice of whether to use RL or bandit models for LLMs? After all, nobody in their right mind would consider modeling a robot or a chess agent as a bandit environment. It’s because in LLMs the transition dynamics are very simple and deterministic. The state is modeled as the sequence of previous tokens and the action is the choice of the next token to decode. This leads to a simple state transition rule: append the action token to the list of previous tokens. Since no new information is revealed after each token, selecting the tokens one by one is equivalent to pre-committing to a sequence of tokens and executing them one by one. This does not prevent us from updating the token probabilities autoregressively by feeding each new token back into the model before generating the next token.

The simplest definition of a contextual bandit environment is a single-step Markov Decision Process (MDP). The agent observes the state (context), executes an action, and receives a reward. Then, the episode terminates and the next episode arrives. A crucial assumption in contextual bandits is that each episode is independent of the actions taken during other episodes. RL, on the other hand, deals with multi-step MDPs in which several state-action-reward steps are repeated until the episode terminates. The state transitions are allowed to depend on the actions.

To better understand the structural differences between RL and bandit approaches, let’s examine their Policy Gradient (PG, [5]) objectives more closely. PPO uses an off-policy PG expression, while REINFORCE uses an on-policy PG, but this distinction is orthogonal to using RL vs bandits, so I will use on-policy PG in both RL and bandit examples. I will strip away all bells and whistles of the algorithms and consider the simplest reasonable RL and bandit formulations for LLM fine-tuning to illustrate the difference between them.

First, a simple RL objective for a single trajectory of length T looks like this:

Compare this to a bandit objective:

The objective functions are remarkably similar. The main difference is that the multi-step RL uses a different advantage value at each step, estimated by an advantage function with partial responses as inputs, usually implemented via a learned value function. On the other hand, the bandit model uses the same advantage value for each step, and the advantage function takes full response as an input. In [3] the bandit advantage is implemented through the Reinforce Leave One Out (RLOO) algorithm, which removes the need for a learned value/baseline function. It generates multiple responses for the same prompt and uses the average reward model scores of other responses as a baseline to calculate the advantage. The core difference between RL and bandit losses is whether the advantage function takes full or partial responses as an input.

The most obvious disadvantage of the bandit approach to LLMs is that they only attribute the reward to the full response, not individual tokens. This is OK for vanilla RLHF, but it would not be efficient for more advanced approaches that attribute some reward values to individual tokens. Dense token-level feedback could be useful to reduce hallucinations and toxicity [6], or reward the model for making partial progress [7], [8].

The table below summarizes the pros and cons of using multi-step RL and bandits for LLM fine-tuning.

It’s worth mentioning that the InstructGPT paper [2] has caused a good amount of confusion in the community about whether they used a bandit or an RL setup for RLHF fine-tuning. Without open source code available, the community was left to guess how exactly RLHF was implemented in InstructGPT (and later in ChatGPT and GPT–4). This is not unlike religious scholars arguing over the meaning of passages from the Bible. My bet is on OpenAI having used multi-step RL and just not being clear in describing it.

Evidence for InstructGPT using multi-step RL:

The paper refers to a separate reward model and value model. This makes sense only in a multi-step RL world.
The paper mentions GAE, which applies only to multi-step RL

Evidence for InstructGPT using bandits:

Quote from the paper: “The environment is a bandit environment which presents a random customer prompt and expects a response to the prompt”.
John Schulman (the main author of the PPO paper and one of the core contributors to InstructGPT) said in a podcast interview shortly after the release of the InstructGPT paper: "...we've been just looking at a contextual bandit problem. ... you get a query and you output a response and then that response gets a reward. So if we had a multi-step process, such as a conversation where you can't assign a reward until the very end of the conversation ... You would probably have to ... train a Q function. I think we'll have to start exploring this at some point soon, but so far we haven't"

Framing LLM fine-tuning as a bandit problem opens the door to future applications of a rich set of bandit methods to LLMs. Contextual bandits are much simpler than RL mathematically, so a wider range of rigorous methods were developed. RL has often followed a more relaxed approach with fewer conclusive proofs and stronger reliance on empirical results. Some examples of bandit-based methods that could be applied to LLMs: offline learning [9], offline evaluation [10], and exploration [11]. Another development that I expect to happen is the customization of contextual bandit methods to work with a unique action space of LLMs: variable-size discrete action space. LLMs generate tokens until the EOS token is returned (or the maximum generation length is reached). Each token can be modeled as a discrete choice along one of the dimensions of the action space, but this leads to a variable dimensionality of the action space, depending on the number of generated tokens. The simple approach used now is to take a sequential product of individual token probabilities as the probability of the response, but more specialized approaches could be useful.

Recap

LLM can be framed as a multi-step MDP and solved with Reinforcement Learning methods. Or it can be framed as a single-step MDP and solved with Contextual Bandit methods.
PPO applied to a multi-step MDP is the de-facto standard approach to RLHF, but a recent paper showed that a simpler REINFORCE algorithm applied to a single-step contextual bandit MDP outperforms PPO.
More research into Contextual Bandit methods for RLHF is likely to come out in the coming years, improving both training and evaluation.

References

Reward Model Overoptimization: Root Causes and Mitigations

Alex Nikulkov — Sun, 07 Apr 2024 02:43:21 GMT

When I first ran an RLHF training job, I was surprised at how easily the reward model scores increased during the training process. It worked on the first attempt. No hyperparameter search, no deep analysis of network weights or gradients. It seemed almost too ideal. Well, it was. While PPO was easily able to push the reward model scores up into almost astronomical values, this didn’t result in perceived improvements to text quality. Instead, the model devolved, generating gibberish, like empty outputs or a single emoji repeated hundreds of times. My initial excitement turned into disappointment, as I was facing a case of reward model overoptimization.

Figure 1. An illustration of reward model overoptimization: training reward scores are increasing, while the true quality peaks and then drops. Image credit: [3]

Reward model overoptimization lives at the intersection of 2 well-known ML concepts: reward hacking and distribution shift. When combined, they create a perfect storm of deception during RLHF training.

Reward hacking occurs when an objective specified for the Reinforcement Learning (RL) agent does not fully encapsulate the creator's intended outcome. Given the complexity of accurately codifying desired behaviors mathematically, it's common for system designers to opt for simpler reward functions. Similarly, RL agents also often seek the path of least resistance, optimizing for the specified rewards in unintended ways. A famous hypothetical paperclip maximizer is given a goal of producing as many paperclips as possible (a reasonable goal to specify for an AI managing a paperclip factory). It might deduce that converting all human matter into paperclips is the most efficient strategy - not quite the behavior we’d want from a factory manager. A less gloomy example comes from OpenAI using RL to train an agent to play CoastRunners, a boat racing game. While humans tend to focus on speed and agility to outpace opponents, the RL agent discovered that endlessly circling to collect bonus items, without ever finishing the race, resulted in a score roughly 20% higher than human players typically achieve. Figure 2 shows an example of this behavior. Victoria Krakovna has a list of many more examples of reward hacking by RL and other similar methods.

Figure 2. Example of reward hacking by Reinforcement Learning agent. Image credit: OpenAI blog post

The second critical factor for overoptimization is distribution shift. In RLHF we use a learned reward model to provide feedback for the agent. The reward model is trained to approximate human preferences, but this approximation is always imperfect. The approximation might be good inside the distribution of training data of the reward model, but it starts breaking down as we move outside of the distribution. Typically, in RLHF, we use the Supervised Fine-Tuned (SFT) checkpoint both to generate training data for the reward model and to initialize the LLM at the beginning of RLHF fine-tuning. In the first steps of RLHF fine-tuning, the LLM is still close to its initial state and its generations remain similar to the ones the reward model was trained on, making the reward scores generally trustworthy. However, as the LLM undergoes further fine-tuning, it begins to produce outputs that diverge from the initial training data, forcing the reward model to increasingly rely on extrapolation, as shown in Figure 3.

Figure 3. Conceptual image of distribution shift during RLHF fine-tuning

A combination of reward hacking and distribution shift inevitably leads to overoptimization. It is easier for the RL agent to learn to exploit the inefficiencies of the reward model than to learn to genuinely improve text quality. Over time real text quality starts breaking down, while the reward model scores keep increasing. It’s easier to cheat on a test than to study for it and well-executed cheating can give you a perfect score. This discrepancy highlights a fundamental challenge in RLHF: ensuring that the pursuit of high reward scores aligns with a genuine improvement in output quality.

The most comprehensive study of RLHF overoptimization comes from an OpenAI paper [1], which introduced a “gold reward model” framework. They used a large transformer model as if it were a true source of human preferences. They then trained smaller reward models on generations labeled by the “gold” model and analyzed the patterns of how both the “gold” and learned reward model scores were changing during RLHF fine-tuning. They find that RLHF (PPO) typically causes more overoptimization than best-of-N sampling. Also, larger reward models and reward models trained on more data are less likely to be overoptimized, suggesting that the scale of the model and the size of its training data play crucial roles in mitigating the risks of overoptimization.

Having a good understanding of overoptimization, we can explore the most common strategies to counteract it. These methods mostly focus on limiting the distribution shift, as reward hacking is generally much harder to prevent without fundamental changes to the reward structure.

KL Regularization. All mainstream RLHF implementations use KL divergence from the SFT policy to regularize the training process and limit the extent of distribution shift. It’s typically implemented by setting a target value for KL divergence and adaptively tuning the weight on KL divergence in the reward to hit the target. Some tuning might be required to find the optimal target KL value, which strikes a good balance between preventing distribution shift and allowing the model enough room to fine-tune.
Early Stopping. Another popular method is early stopping or a selection of an intermediate checkpoint. The main idea is to stop training early enough that overoptimization hasn’t damaged the quality of the model yet. The OpenAI paper [1] shows that early stopping has a very similar effect as KL regularization on overoptimization and on the overall quality of the fine-tuned model. The main advantage of early stopping is that the training run can finish faster and use fewer computational resources. This makes early stopping a compelling way of fighting overoptimization, especially if you can find a good stopping policy. One practical option suggested in the Anthropic Helpful&Harmless paper [2] is to use a separate “validation” reward model (trained on a held-out dataset) to score the generations during training. The training is then stopped when the validation score peaks and starts decreasing.
Uncertainty Quantification and Conservative Reward Models. The root cause of overoptimization is that during the distribution shift, the reward model is making wrong predictions with high confidence. Making the reward model output a confidence interval instead of a point prediction could help prevent reliance on overestimated scores. This requires epistemic uncertainty quantification methods, like ensembles or Bayesian neural networks. A recent paper [3] trains an ensemble of reward models and shows that using a conservative reward estimate (e.g. the lowest output of the ensemble members) prevents overoptimization and improves LLM quality.
Constrained Optimization. Greed is the root of all evil, so we could try not to be greedy and target a limited improvement to the reward scores instead of pushing them up as high as they would go. A paper [4] proposes to use constrained optimization to run multi-objective RLHF on several reward models simultaneously without causing overoptimization. Similar to KL regularization and early stopping, the main challenge here is to figure out the right value for target reward score improvement.
Offline Reinforcement Learning (RL). Learning from a limited set of logged data is a popular approach in RL. While typical PPO-based RLHF pipelines use online RL, the general idea of limiting the distribution shift applies to both Offline RL and overoptimization prevention. Papers like Implicit Language Q-Learning (ILQL) [5] have applied Offline RL methods to RLHF but without an explicit focus on overoptimization prevention. There might be an opportunity to apply popular Offline RL methods like Conservative Q-Learning (CQL) [6] to RLHF.
Improved Reward Model Generalization. The OpenAI paper [1] showed that using larger reward models and training them on more data makes these models less likely to be overoptimized. “More data'' could mean one of 2 things: (a) more samples from the same distribution; or (b) more diverse training data distribution. It seems reasonable to assume that training on more diverse datasets would make the reward model more generalizable, especially if some of the training data covers regions into which RLHF fine-tuning is likely to push the LLM. This might be the most promising method for long-term improvements since it relaxes the limit on how much the LLM can change during RLHF fine-tuning.

The first step to fight overoptimization is detection, especially outside the controlled conditions of a "gold model" setup. Without a reliable benchmark for true quality, it can be really hard to distinguish between reward scores increasing due to genuine quality improvement or overoptimization. We can’t fight what we can’t see, so better ways of detecting overoptimization need to be developed and popularized. Two promising methods for enhancing detectability are:

Epistemic Uncertainty Measurement. An ensemble of reward models or a Bayesian neural network could be used to measure the level of the reward model’s confidence in its predictions. When confidence intervals get too wide, it may indicate that the model is venturing into unfamiliar territory, signaling potential overoptimization.
Separate Reward Model for Evaluation. We can train a separate reward to evaluate the quality of generated text during training. This model should be as different as possible from the main reward model used for RLHF training. It can be trained on separate data (a different split of a dataset, or a completely different dataset), and use different weight initialization and hyperparameter values. This model would be unlikely to have the same extrapolation error as the main training reward model, so it can be used as an impartial evaluator.

Takeaways:

Reward model optimization happens during RLHF because RL finds a way to reward-hack an imperfect reward model, which crumbles under the induced distribution shift.
Many ways to minimize overoptimization have been developed, most of them focusing on limiting the distribution shift from the model that generated the reward model training data. However, improving reward model generalization properties might be the most promising long-term direction.
Overoptimization is hard to detect. Special measures need to be taken to detect and measure overoptimization.

References:

Reward Modeling for RLHF

Alex Nikulkov — Wed, 10 Jan 2024 06:05:58 GMT

Introduction

Reward modeling - it’s an essential part of RLHF training pipelines, yet it doesn’t get even a fraction of the attention of other LLM topics like prompting, supervised fine-tuning and data collection. In this post we’ll take a closer look at reward modeling and shed some light on this dark corner of RLHF. All top-performing Large Language Models (LLMs) like GPT-4 and Llama 2 used reward models as part of their training pipelines. A notable exception from this are recent LLMs trained with Direct Preference Optimization (DPO), which doesn’t require a separate reward model for LLM training - more on that later.

Figure 1: Typical RLHF training pipeline. Step 2 is reward model training. Credit: InstructGPT paper

Reward modeling is an intermediate step in the LLM training pipeline and the reward model usually isn’t published alongside the trained language model. Think of a reward model as the movie director - you can’t see it in the final product, yet its impact is undeniable. Quoting from the Llama 2 paper: “We note that reward model accuracy is one of the most important proxies for the final performance of Llama 2-Chat”. This will make sense once we understand the role which the reward model plays in the RLHF training pipeline. From the perspective of classical Reinforcement Learning, reward models have a role similar to that of the critic in actor-critic algorithms, with the main difference being that critic models are usually trained online in parallel with actor training, while the reward models are usually trained offline and held fixed during the LLM fine-tuning (or re-trained a handful of times if several rounds of human feedback collection are performed, like in the Llama 2 paper).

So what is a reward model? It takes a prompt and a response as inputs and returns a single scalar - the predicted quality of the response. It is based on the same architecture as the language model (transformer), but the unembedding (output) linear layer of size (Embed_dim, Vocab_size) is replaced by another linear layer of size (Embed_dim, 1), which outputs the scalar predicted reward (see Figure 2). Optionally, a special reward readout token can be used. The predicted reward is read out from the output of the reward head at the last token of the response. Since the reward is read out just once for the whole response, it can only score the whole response, not individual tokens or partial responses. A pre-trained (e.g. SFT) language model weights are used to initialize the reward model weights to speed up training.

Figure 2: Reward model architecture (with optional reward readout tokens). Credit: “State of GPT” talk by Andrej Karpathy

Reward Model Applications

To better understand the role of the reward model in RLHF training, let’s take a closer look at Proximal Policy Optimization (PPO) - the de-facto standard RLHF algorithm. PPO is an online RL algorithm - it requires live interaction with the environment to close the state->action->reward loop. The environment takes action as input and returns the reward and next state. To truly align with human preferences, we could show the model’s responses in real time to human raters and use their ratings as reward for training (with a caveat that PPO needs scalar reward, so we’d need some processing to turn relative preferences into scalar rewards). But this would be logistically impractical, so the trained reward model is used as a proxy for the environment during PPO training and we pretend that the reward predicted by the reward model is the true quality of the response. This points to one of the biggest problems in RLHF - the trained reward model is an imperfect representation of true human preferences and it’s easily overoptimized by RL, which is known to exploit imperfections in reward model definition in the spirit of Goodhart’s law. More on this at the end of the post (and even more in a separate upcoming post). Figure 3 shows how the reward model plugs into the PPO training loop.

Reward model has some similarity to model-based RL, but the similarity is superficial because in RLHF the reward model is used to score complete responses, while in model-based RL the models are used for multi-step trajectory planning. While the application of reward models to LLM tuning is a recent phenomenon, reward modeling for RLHF was first demonstrated at scale back in 2017 in an OpenAI paper in application to classical RL environments like Atari.

Figure 3. RLHF PPO training: LLM and Reward Model input/output structure

Reward models have a few more important applications - (1) turn relative preferences into scalar rewards, (2) offline evaluation; (3) ranking. First, typical RLHF pipelines collect human preferences in the form of relative preferences (response A > response B), while PPO requires a scalar reward. We can use the Bradley-Terry model to turn these relative preferences into continuous scalar rewards (more on this in the next paragraph). Second, the reward model estimates the quality of a response, so it could be used for offline evaluation of language models. The main thing to be careful about with offline evaluation is to prevent data leakage - the reward model used for offline evaluation should be trained on a separate dataset with no overlap with LLM training data. Finally, the reward model can be used to rank multiple LLM generations and choose the best one. This is sometimes known as “best-of-n” in RLHF literature (e.g. RRHF paper). The ranking can be done either at inference time (pros: no LLM fine-tuning required; cons: high inference compute cost because multiple generations are required) or at training time (e.g. RAFT paper).

Reward Model Training

The most popular method of training the reward model is based on the Bradley-Terry model, which assumes that each response has an unobserved intrinsic quality u and the probability that a rater prefers option i to option j is equal to:

This probability can be incorporated into a log-likelihood-based loss to train the reward model using pairs of responses to the same prompt labeled as better/worse. Note that in the Bradley-Terry model the qualities are defined up to an additive constant because (u_i-u_j)=((u_i+C)-(u_j+C)). This means that each individual quality score isn’t interpretable by itself and only has some meaning when compared to other scores. In practice, after the reward model is trained, its output can be normalized by adding a bias term, so that the reward scores have 0 mean on the training data. This isn’t strictly necessary, but normalizing the scores can help reduce the variance during PPO training.

The Bradley-Terry model is closely related to the Elo rating system, and the reward model scores have the same properties as the Elo scores (up to scale/offset transformation). There’s several more advanced implementation options for the Bradley-Terry model. In the InstructGPT paper, K>2 responses are given to labelers and they are asked to rank the whole list, resulting in (K choose 2) pairwise comparisons, each of which is represented by a separate term in the loss function. This is reported to speed up data collection because each response can be compared to several other responses independently. In Llama 2 paper, confidence-based margin is added to the loss. The intuition here is that the difference between the reward model scores of better and worse options should be higher for pairs in which we have high confidence in their relative order, and can be lower if we’re not sure which of the options is really better.

An alternative way to train a reward model could use a regression loss with scalar reward labels. For example, if we want to train an LLM to write tweets which get as many likes as possible, we could use the number of likes on a tweet as the reward label and train a reward model to predict this continuous label using an MSE loss. Another possible option is to use the outputs of existing classifiers as reward scores. For example, in this paper an existing aesthetic classifier was used as a reward model for PPO training. Similarly, existing classifiers can be used as reward models to fine-tune the models to reduce hallucinations, improve conciseness, etc. Using feedback from classifier as a reward isn’t technically RLHF, but it’s close enough in spirit and in implementation details.

RLHF without Reward Models

But do we REALLY need the reward model for RLHF? The authors of DPO paper show an attractive way to do something that looks very similar to RLHF (the debate is still on about whether DPO matches the “RL” part of RLHF), but without a separate reward model. Instead, they use a clever mathematical trick to derive the implicit reward model scores based on the LLM token likelihoods. This trick allows them to derive a loss which looks like a mix between reward model and language model losses. The jury is still out on whether DPO is the future of RLHF, but we start seeing more and more state-of-the-art LLMs like Zephyr and Solar trained with DPO instead of PPO. The DPO vs PPO debate deserves more space than I can afford here, but Nathan Lambert has several very insightful blog posts about DPO: blog post 1, blog post 2 - I highly recommend them to anyone who’s interested in this topic.

Open Questions

Reward modeling for RLHF appeared relatively recently and has received less attention than language model training, so there are still plenty of open questions left. The most interesting ones, in my opinion, are:

Credit assignment. The reward model evaluates complete responses, but they are actually generated token-by-token. The current PPO implementation used in RLHF side-steps this problem by using a “bandit” formulation, in which the whole response is considered to be a single action by the agent. If we could assign granular rewards to individual tokens (or sub-sequences) in the response, we could use more powerful RL methods to perform forward-looking planning during token generation.
Preventing overoptimization. Naively applying PPO to a trained reward model will result in reward model scores increasing significantly during training, while the quality of the text will actually degrade. This is a result of overoptimizing to an imperfect learned reward model. Preventing this overoptimization is one of the biggest questions in RLHF right now. A wide range of approaches has been proposed, but it’s unclear which of them work best: regularization (KL divergence, supervised or unsupervised language modeling losses), model ensembles, constrained optimization, early stopping. This is a very deep and interesting question, deserving a dedicated post. Stay tuned!
Beyond Bradley-Terry. The ubiquitous Bradley-Terry model makes several simplifying assumptions: (1) pairwise preferences are determined by underlying single-item qualities; (2) the single-item qualities are scalars (no multidimensional preferences); (3) the pairwise preferences are stochastic with probabilities equal to the sigmoid of difference in qualities. While these simple assumptions are a good place to start, the field of Behavioral Economics has conclusively shown that human preferences are very complex and can include effects like anchoring, framing, loss aversion, etc. which contradict the Bradley-Terry model. A few recent papers like IPO and HALO have relaxed these assumptions and showed promising results.
Evaluation. How do we know if our reward model is good enough? The most common metric here is pairwise accuracy - the fraction of pairs of responses in which the better option has a higher reward model score than the worse option. This is a useful metric, but it misses important aspects like the magnitude of the difference in scores. Also, it’s unclear what kind of evaluation metrics could describe specifically whether the reward model is suitable as a reward for PPO training.

References

Hello World

Alex Nikulkov — Sun, 07 Jan 2024 06:23:24 GMT

Welcome to Reinforced!

If you want to learn more about Reinforcement Learning, Generative AI and Reinforcement Learning from Human Feedback (RLHF) - you’ve come to the right place. In this blog/newsletter I will help you improve your understanding of how Reinforcement Learning is used to power the ongoing Generative AI revolution. You can expect deep dives into most common RLHF methods and components, overviews of new research papers and more.

First post coming soon!