verl icon indicating copy to clipboard operation
verl copied to clipboard

[recipe] Fix FlowRL actor to pure implementation

Open Xuekai-Zhu opened this issue 1 month ago • 0 comments

Summary

This PR refactors the FlowRL actor implementation by removing CISPO-specific features and simplifying to a pure FlowRL trajectory balance objective with importance weight clipping.

Changes

Removed

  • Ablation study code: Deleted compute_flowrl_cispo_clip_ablation function and environment variable switching logic

    Modified

    • Function rename: compute_flowrl_cispo_clipcompute_flowrl to better reflect the pure implementation
    • Simplified masking: Now uses response_mask directly without additional condition-based filtering
    • Cleaner metrics: Keeps essential metrics (log_prob, log_z, importance_weight, PPO KL, reference KL)

    Kept

    • Core FlowRL objective: Trajectory balance loss L = E[w * (log Z + log p_θ - β*R - log p_ref)²]
    • Importance weight clipping: Maintains stability with max=10 clipping
    • Log partition function (log Z): Projection network for estimating partition function

Xuekai-Zhu avatar Dec 03 '25 06:12 Xuekai-Zhu