Discussion about this post

User's avatar
MatthewK's avatar

I’m expecting real RL to make a comeback since the limitations of contextual bandits are becoming clear (eg RL can’t teach models to do new things they didn’t do with some probability already).

I hope you return to make more posts!

Daniel Popescu / ⧉ Pluralisk's avatar

This article comes at the perfect time; it's genuinely insightful to see someone question the PPO dogma when many of us have been wondering if the algorithmic overhead was truly justifed for LLM fine-tuning, or if we were just trying to fit a square peg in a robotics-shaped hole.

No posts

Ready for more?