Discussion about this post

User's avatar
Neural Foundry's avatar

The "rich get richer" framing for positive on-policy gradients clicked for me. I've noticed in some training runs that early performance improvemnts on pass@1 look great initially, but then diversity absolutely tanks and the model starts repeating the same solutions even when they're wrong. The negative gradient discussion helps explain why NSR might be underused in practice, people naturally gravitate toward reinforcing what works rather than pruning what dosn't. One thing I'd be curious about is whether theres a sweet spot ratio between positive and negative examples that maintains enough exploitation for quick gains but keeps exploration alive.

Expand full comment
Zhenyu Liao's avatar

Read through the whole document and it's so well organized insightful as always. Thanks alex !

Merry Christmas!

Expand full comment

No posts

Ready for more?