verl
verl copied to clipboard
Support flowgrpo and mixgrpo
Feature request
support flowgrpo and mixgrpo
Motivation
An increasing number of papers on multimodal generative models are exploring Reinforcement Learning (RL) instead of Direct Preference Optimization (DPO).
Your contribution
TODO