Is the training on-policy?
Each step will conduct rollout, so I guess the training is on-policy
I guess so.
Is it possible to make it off-policy?
@XianglongTan
What kind of off-policy would you like to have? Collected trajectories in one iteration can be spited into multiple mini batches, and this is a kind of off-policy. This can be easily done since this part is supported by verl. Collecting data using any policy (out-of-date policy, a more powerful model, ...) and using these data in training is off-policy in a broader sense. Our framework does not have constrain here, but this may require more advanced algorithm to make use of these off-policy data.