gauravpandeyamu comments

Results 8 comments of


                                            gauravpandeyamu

GRPO implementation update

That's a good catch. While kl1 estimator is a "decent" estimator of kl divergence, its gradient is not the "correct" estimator of the gradient of kl divergence. I suspect that...

Interesting. I agree that it might work if added in the loss directly. I have modified my [local fork](https://github.com/gauravpandeyamu/open-instruct/blob/main/open_instruct/grpo_vllm_thread_ray_gtrl.py#L1172C46-L1172C58) with the following changes: # kl loss should be computed without...

GRPO implementation update

Great. Thanks

GRPO implementation update

Since RLVR looks only at the final answer completely ignoring the the text generated, the model can be motivated to generate text that deviates from the reference policy as long...

GRPO implementation update

Ahh, yes. Now, the graphs of kl1, kl2, kl3 and kl4 make perfect sense. As for why kl1 estimator is a bad estimator, and how multiplying by the ratio fixes...

GRPO implementation update

> > While kl1 estimator is a "decent" estimator of kl divergence, its gradient is not the "correct" estimator of the gradient of kl divergence. > > If I am...

GRPO implementation update

Also worth noting that $E_{\pi_t} \left[\frac{\pi_\theta}{\pi_{ref}} -\log \frac{\pi_\theta}{\pi_{ref}} - 1\right]$ is a valid divergence and the kl3 estimator and its gradient are unbiased estimators of this divergence and its gradient...

GRPO implementation update

I agree with KL2 being a more principled choice for optimization. There are works that explore f-divergences in the PPO objective https://arxiv.org/pdf/2309.16240