When I first ran an RLHF training job, I was surprised at how easily the reward model scores increased during the training process.
Share this post
Reward Model Overoptimization: Root Causes…
Share this post
When I first ran an RLHF training job, I was surprised at how easily the reward model scores increased during the training process.