RL icon indicating copy to clipboard operation
RL copied to clipboard

feat: force on-policy ratio to 1

Open yfw opened this issue 1 month ago • 1 comments

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • [ ] Make sure you read and followed Contributor guidelines
  • [ ] Did you write any new necessary tests?
  • [ ] Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • [ ] Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features

    • Added force_on_policy_ratio configuration option to enforce on-policy PPO behavior.
    • Introduced probability ratio min/max metrics to track policy optimization dynamics.
  • Chores

    • Updated configuration files and tests to support the new on-policy ratio control option.

✏️ Tip: You can customize this high-level summary in your review settings.

yfw avatar Nov 17 '25 18:11 yfw

📝 Walkthrough

Walkthrough

A new configuration option force_on_policy_ratio (default false) is added across configuration files and implemented in core loss and training logic. When enabled, the PPO ratio is forced to 1.0 with the prerequisite that train_global_batch_size equals num_prompts_per_step * num_generations_per_prompt. Implementation includes initialization validation, min/max metrics tracking for probability ratios, and unit tests.

Changes

Cohort / File(s) Summary
Configuration files
examples/configs/grpo_math_1B.yaml, examples/configs/vlm_grpo_3B.yaml, examples/configs/vlm_grpo_3B_megatron.yaml, examples/penguin/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yaml
Added force_on_policy_ratio: false configuration field under loss_fn with comment noting prerequisite constraint on batch sizes.
Core loss function logic
nemo_rl/algorithms/loss_functions.py
Introduced force_on_policy_ratio toggle in ClippedPGLoss to force PPO ratio to 1.0 when enabled. Added min/max tracking metrics for probability ratios (probs_ratio_min, probs_ratio_max, probs_ratio_clamped_min, probs_ratio_clamped_max) with infinite-value filtering. Extended SequencePackingLossWrapper metric aggregation logic. Added math import for safe infinity handling.
GRPO training logic
nemo_rl/algorithms/grpo.py
Added validation after loss function initialization to ensure force_on_policy_ratio is only enabled when train_global_batch_size equals num_prompts_per_step * num_generations_per_prompt. Extended metrics aggregation in both grpo_train and async_grpo_train to compute min/max for ratio-related keys with infinite-value filtering.
Unit tests
tests/unit/algorithms/test_grpo.py, tests/unit/algorithms/test_loss_functions.py, tests/unit/algorithms/test_sequence_packing_gradients.py, tests/unit/models/policy/test_dtensor_worker.py, tests/unit/models/policy/test_megatron_worker.py
Added force_on_policy_ratio: False to ClippedPGLossFn configuration in test fixtures. Added new unit test test_clipped_pg_loss_force_on_policy_ratio validating ratio forcing and metrics propagation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • nemo_rl/algorithms/loss_functions.py: Contains the core logic for force_on_policy_ratio enforcement and new metrics tracking; requires careful review of the ratio computation path, infinite-value handling, and metric aggregation logic.
  • nemo_rl/algorithms/grpo.py: Validation logic and metrics aggregation changes across two training functions; verify constraint checking is properly applied in both code paths.
  • tests/unit/algorithms/test_loss_functions.py: New test validates forced ratio behavior; verify synthetic data setup and expected loss computation are correct.

Possibly related PRs

  • NVIDIA-NeMo/RL#1382: Modifies ClippedPGLossConfig typing and related loss function infrastructure, directly overlapping with configuration and initialization logic in this PR.

Suggested reviewers

  • parthchadha
  • ashors1

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Test Results For Major Changes ⚠️ Warning PR introduces significant GRPO algorithm feature but lacks documented test results and validation evidence in description. Add test results section documenting unit test passes, new test outcomes, regression testing with flag disabled, and performance metrics confirming feature works without degradation.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: force on-policy ratio to 1' directly summarizes the main change across all modified files - adding a force_on_policy_ratio configuration option to force PPO ratios to 1.0.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • [ ] 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment
  • [ ] Commit unit tests in branch yifu/force_on_policy

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Dec 02 '25 07:12 coderabbitai[bot]