Is the training on-policy?

Open XianglongTan opened this issue 3 months ago • 3 comments

Each step will conduct rollout, so I guess the training is on-policy

Sep 19 '25 06:09 XianglongTan

I guess so.

Sep 19 '25 17:09 ultmaster

Is it possible to make it off-policy?

Sep 22 '25 02:09 XianglongTan

@XianglongTan
What kind of off-policy would you like to have? Collected trajectories in one iteration can be spited into multiple mini batches, and this is a kind of off-policy. This can be easily done since this part is supported by verl. Collecting data using any policy (out-of-date policy, a more powerful model, ...) and using these data in training is off-policy in a broader sense. Our framework does not have constrain here, but this may require more advanced algorithm to make use of these off-policy data.

Sep 28 '25 09:09 XufangLuo