gauravpandeyamu

Results 8 comments of gauravpandeyamu

That's a good catch. While kl1 estimator is a "decent" estimator of kl divergence, its gradient is not the "correct" estimator of the gradient of kl divergence. I suspect that...

Interesting. I agree that it might work if added in the loss directly. I have modified my [local fork](https://github.com/gauravpandeyamu/open-instruct/blob/main/open_instruct/grpo_vllm_thread_ray_gtrl.py#L1172C46-L1172C58) with the following changes: # kl loss should be computed without...

Since RLVR looks only at the final answer completely ignoring the the text generated, the model can be motivated to generate text that deviates from the reference policy as long...

Ahh, yes. Now, the graphs of kl1, kl2, kl3 and kl4 make perfect sense. As for why kl1 estimator is a bad estimator, and how multiplying by the ratio fixes...

> > While kl1 estimator is a "decent" estimator of kl divergence, its gradient is not the "correct" estimator of the gradient of kl divergence. > > If I am...

Also worth noting that $E_{\pi_t} \left[\frac{\pi_\theta}{\pi_{ref}} -\log \frac{\pi_\theta}{\pi_{ref}} - 1\right]$ is a valid divergence and the kl3 estimator and its gradient are unbiased estimators of this divergence and its gradient...

I agree with KL2 being a more principled choice for optimization. There are works that explore f-divergences in the PPO objective https://arxiv.org/pdf/2309.16240