verl
verl copied to clipboard
[recipe] Fix FlowRL actor to pure implementation
Summary
This PR refactors the FlowRL actor implementation by removing CISPO-specific features and simplifying to a pure FlowRL trajectory balance objective with importance weight clipping.
Changes
Removed
-
Ablation study code: Deleted
compute_flowrl_cispo_clip_ablationfunction and environment variable switching logicModified
- Function rename:
compute_flowrl_cispo_clip→compute_flowrlto better reflect the pure implementation - Simplified masking: Now uses
response_maskdirectly without additional condition-based filtering - Cleaner metrics: Keeps essential metrics (log_prob, log_z, importance_weight, PPO KL, reference KL)
Kept
- Core FlowRL objective: Trajectory balance loss
L = E[w * (log Z + log p_θ - β*R - log p_ref)²] - Importance weight clipping: Maintains stability with
max=10clipping - Log partition function (log Z): Projection network for estimating partition function
- Function rename: