fix: Add helpful error message if prepare_for_*() not called
#1141
Adds checks to train, get_logprobs, and generate in MegatronPolicyWorker to raise a clear error if the model is not prepared for GPU execution.
Please let me know if my approach or fix needs any improvements . I’m open to feedback and happy to make changes based on suggestions. Thankyou !
Summary by CodeRabbit
-
New Features
- Adds a readiness check requiring explicit preparation before training or inference, improving safety and clarity.
- Provides consistent, user-friendly error messages if operations are attempted before preparation.
-
Bug Fixes
- Prevents accidental GPU execution before the model is properly prepared, reducing crashes and undefined behavior during training and generation.
📝 Walkthrough
Walkthrough
Adds a per-instance readiness flag to MegatronPolicyWorker. Methods train, get_logprobs, and generate now raise RuntimeError if called before preparation. The flag is set during prepare_for_training and prepare_for_lp_inference. This enforces an explicit preparation step before GPU-bound execution.
Changes
| Cohort / File(s) | Summary |
|---|---|
Readiness gating for GPU executionnemo_rl/models/policy/megatron_policy_worker.py |
Introduced is_prepared flag (default False). Added guards in train, get_logprobs, and generate to raise RuntimeError if not prepared. Set is_prepared = True in prepare_for_training and prepare_for_lp_inference. Centralized error message for unprepared invocation. |
Sequence Diagram(s)
sequenceDiagram
autonumber
participant C as Caller
participant W as MegatronPolicyWorker
Note over W: is_prepared defaults to False
C->>W: prepare_for_training() / prepare_for_lp_inference()
activate W
W->>W: set is_prepared = True
deactivate W
alt Prepared
C->>W: train()/get_logprobs()/generate()
W-->>C: proceed with GPU-bound execution
else Not prepared
C->>W: train()/get_logprobs()/generate()
W-->>C: RuntimeError("Model must be prepared before execution")
end
Note over W: Guards enforce explicit preparation before use
Estimated code review effort
🎯 2 (Simple) | ⏱️ ~10 minutes
Pre-merge checks and finishing touches
❌ Failed checks (1 warning)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Docstring Coverage | ⚠️ Warning | Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. | You can run @coderabbitai generate docstrings to improve docstring coverage. |
✅ Passed checks (3 passed)
| Check name | Status | Explanation |
|---|---|---|
| Description Check | ✅ Passed | Check skipped - CodeRabbit’s high-level summary is enabled. |
| Test Results For Major Changes | ✅ Passed | The PR adds readiness guards and a boolean flag to MegatronPolicyWorker to raise clear errors if training/inference methods are called before preparation. This is a behavioral safety check that does not alter numerics, convergence, or steady-state performance when properly prepared, and only adds early-exit errors otherwise. The PR description does not include test results, but given the scope, this qualifies as a minor change under the check’s criteria. Therefore, test results are not required for this PR to pass the check. |
| Title Check | ✅ Passed | The title clearly and concisely captures the primary change—adding an explicit error message when prepare_for_*() is not called—using straightforward language that aligns with the PR’s objective of enforcing preparation steps before GPU execution. |
✨ Finishing touches
- [ ] 📝 Generate docstrings
🧪 Generate unit tests (beta)
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
@terrykong I've updated the changes as per your suggestions, please let me know if this needs improvement.
Hi @Aniketsy . So right now it looks like once you call prepare once the flag is never set back to false, so you can still fall into the situation where it fails and you don't get this nice message because the parameters were offloaded.
Do you see any way to do this without introducing a complex state machine?
@terrykong Sorry, I missed the notification. I’ll look into this again. I’ll also explore if there’s a way to address this without introducing a complex state machine.