feat: force on-policy ratio to 1
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
- You can potentially add a usage example below
# Add a code snippet demonstrating how to use this
Before your PR is "Ready for review"
Pre checks:
- [ ] Make sure you read and followed Contributor guidelines
- [ ] Did you write any new necessary tests?
- [ ] Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
- [ ] Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.
Additional Information
- ...
Summary by CodeRabbit
-
New Features
- Added
force_on_policy_ratioconfiguration option to enforce on-policy PPO behavior. - Introduced probability ratio min/max metrics to track policy optimization dynamics.
- Added
-
Chores
- Updated configuration files and tests to support the new on-policy ratio control option.
✏️ Tip: You can customize this high-level summary in your review settings.
📝 Walkthrough
Walkthrough
A new configuration option force_on_policy_ratio (default false) is added across configuration files and implemented in core loss and training logic. When enabled, the PPO ratio is forced to 1.0 with the prerequisite that train_global_batch_size equals num_prompts_per_step * num_generations_per_prompt. Implementation includes initialization validation, min/max metrics tracking for probability ratios, and unit tests.
Changes
| Cohort / File(s) | Summary |
|---|---|
Configuration files examples/configs/grpo_math_1B.yaml, examples/configs/vlm_grpo_3B.yaml, examples/configs/vlm_grpo_3B_megatron.yaml, examples/penguin/grpo_dapo17k_bytedtsinghua_qwen3_4binstruct_nf.yaml |
Added force_on_policy_ratio: false configuration field under loss_fn with comment noting prerequisite constraint on batch sizes. |
Core loss function logic nemo_rl/algorithms/loss_functions.py |
Introduced force_on_policy_ratio toggle in ClippedPGLoss to force PPO ratio to 1.0 when enabled. Added min/max tracking metrics for probability ratios (probs_ratio_min, probs_ratio_max, probs_ratio_clamped_min, probs_ratio_clamped_max) with infinite-value filtering. Extended SequencePackingLossWrapper metric aggregation logic. Added math import for safe infinity handling. |
GRPO training logic nemo_rl/algorithms/grpo.py |
Added validation after loss function initialization to ensure force_on_policy_ratio is only enabled when train_global_batch_size equals num_prompts_per_step * num_generations_per_prompt. Extended metrics aggregation in both grpo_train and async_grpo_train to compute min/max for ratio-related keys with infinite-value filtering. |
Unit tests tests/unit/algorithms/test_grpo.py, tests/unit/algorithms/test_loss_functions.py, tests/unit/algorithms/test_sequence_packing_gradients.py, tests/unit/models/policy/test_dtensor_worker.py, tests/unit/models/policy/test_megatron_worker.py |
Added force_on_policy_ratio: False to ClippedPGLossFn configuration in test fixtures. Added new unit test test_clipped_pg_loss_force_on_policy_ratio validating ratio forcing and metrics propagation. |
Estimated code review effort
🎯 3 (Moderate) | ⏱️ ~25 minutes
- nemo_rl/algorithms/loss_functions.py: Contains the core logic for force_on_policy_ratio enforcement and new metrics tracking; requires careful review of the ratio computation path, infinite-value handling, and metric aggregation logic.
- nemo_rl/algorithms/grpo.py: Validation logic and metrics aggregation changes across two training functions; verify constraint checking is properly applied in both code paths.
- tests/unit/algorithms/test_loss_functions.py: New test validates forced ratio behavior; verify synthetic data setup and expected loss computation are correct.
Possibly related PRs
- NVIDIA-NeMo/RL#1382: Modifies ClippedPGLossConfig typing and related loss function infrastructure, directly overlapping with configuration and initialization logic in this PR.
Suggested reviewers
- parthchadha
- ashors1
Pre-merge checks and finishing touches
❌ Failed checks (2 warnings)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Docstring Coverage | ⚠️ Warning | Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. | You can run @coderabbitai generate docstrings to improve docstring coverage. |
| Test Results For Major Changes | ⚠️ Warning | PR introduces significant GRPO algorithm feature but lacks documented test results and validation evidence in description. | Add test results section documenting unit test passes, new test outcomes, regression testing with flag disabled, and performance metrics confirming feature works without degradation. |
✅ Passed checks (2 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title check | ✅ Passed | The title 'feat: force on-policy ratio to 1' directly summarizes the main change across all modified files - adding a force_on_policy_ratio configuration option to force PPO ratios to 1.0. |
| Description Check | ✅ Passed | Check skipped - CodeRabbit’s high-level summary is enabled. |
✨ Finishing touches
- [ ] 📝 Generate docstrings
🧪 Generate unit tests (beta)
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
- [ ] Commit unit tests in branch
yifu/force_on_policy
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.