[recipe] Fix FlowRL actor to pure implementation

Open Xuekai-Zhu opened this issue 1 month ago • 0 comments

Summary

This PR refactors the FlowRL actor implementation by removing CISPO-specific features and simplifying to a pure FlowRL trajectory balance objective with importance weight clipping.

Changes

Removed

Ablation study code: Deleted compute_flowrl_cispo_clip_ablation function and environment variable switching logic

Modified
- Function rename: compute_flowrl_cispo_clip → compute_flowrl to better reflect the pure implementation
- Simplified masking: Now uses response_mask directly without additional condition-based filtering
- Cleaner metrics: Keeps essential metrics (log_prob, log_z, importance_weight, PPO KL, reference KL)
Kept
- Core FlowRL objective: Trajectory balance loss L = E[w * (log Z + log p_θ - β*R - log p_ref)²]
- Importance weight clipping: Maintains stability with max=10 clipping
- Log partition function (log Z): Projection network for estimating partition function

Dec 03 '25 06:12 Xuekai-Zhu

[recipe] Fix FlowRL actor to pure implementation

Summary

Changes

Removed

Modified

Kept