Subscribe
Sign in
Home
Archive
About
Bandits vs Reinforcement Learning from Human Feedback
Are single-step or multi-step models better suited for RLHF? Find out the main differences between them in this post
Apr 30, 2024
•
Alex Nikulkov
4
Reward Model Overoptimization: Root Causes and Mitigations
When I first ran an RLHF training job, I was surprised at how easily the reward model scores increased during the training process.
Apr 7, 2024
•
Alex Nikulkov
1
January 2024
Reward Modeling for RLHF
An introduction to reward models
Jan 10, 2024
•
Alex Nikulkov
1
Hello World
Welcome to Reinforced!
Jan 7, 2024
•
Alex Nikulkov
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts