trl
trl copied to clipboard
why kl = nan when grpo train?
the question is:https://github.com/huggingface/open-r1/issues/704 can somebody help me?
The beta parameter of GRPOConfig (documentation here) must be explicitly modified to non-zero for the reference model to load and to get KL data.