Fix: forbid repeated deepspeed.initialize on training objects
Previously closed PR: https://github.com/microsoft/DeepSpeed/pull/6848
Partially Fixes: https://github.com/microsoft/DeepSpeed/issues/6772 https://github.com/microsoft/DeepSpeed/issues/6771 https://github.com/microsoft/DeepSpeed/issues/6770 by forbidding repeated initialization.
What are changed:
- Marking 'model', 'optimizer' and 'lr_scheduler' in the arguments of
deepspeed.initializewith the flagds_is_inited = True. - Marking 'engine', 'engine.optimizer' and 'engine.lr_scheduler' in the return values of
deepspeed.initializewith the flagds_is_inited = True. - When calling
deepspeed.initialize, raise an exception if detectedds_is_inited == Truein the inputmodel,optimizerorlr_scheduler
Expected Behavior:
Forbid repeated deepspeed.initialize invocations on model, optimizer and lr_scheduler objects.
@microsoft-github-policy-service agree
This fix still has interference with existing unit tests. Let me double check before we proceed.
The unit tests in tests/unit/runtime/test_ds_initialize.py all passed.
The PR is ready for review @tjruwase.
I am not able to check other unit tests due to GPU memory constraint.
Thanks for the review @tjruwase.
- I added handling for callable types for optimizer and lr_scheduler. The handling is to only mark
is_ds_initedfor objects instead of callables, as the callables are not stateful and reuse should be allowed.
Regarding,
I think this call should be moved into
_mark_trainobjs_initialized()
I think _assert_trainobjs_not_inited should still be separated from _mark_trainobjs_initialized as _mark_trainobjs_initialized is also called on the wrapped model and optimizers before exiting from deepspeed.initialize. The wrapped models may already have is_ds_inited being True since in DeepSpeedEngine all model flags will be passed through on the wrapper.
If we still want to keep _assert_trainobjs_not_inited inside _mark_trainobjs_initialized, we can do either of the three:
- add a flag to
_mark_trainobjs_initializedto indicate whether the *not_inited` check should be performed - add/check a flag using
__dict__rather than setattr/getattr - check whether inited use type information as well, i.e. for types of DeepSpeedEngine we directly return
inited == Trueinstead of checking for flags.
If we still want to keep
_assert_trainobjs_not_initedinside_mark_trainobjs_initialized, we can do either of the three:
Thanks for raising this important design issue. Below are my thoughts and/or clarifications.
- DeepSpeed is designed to optimize PyTorch, and currently does this by wrapping PyTorch constructs, e.g., nn.Module, optim.Optimizer, optim.lr_scheduler, etc.
- To avoid nesting issues, DeepSpeed should not wrap DeepSpeed data structures, such as DeepSpeedEngine, DeepSpeedOptimizer, DeepSpeedDataLoader, etc.
Based on the above, I am aligned with your 3rd suggestion (below) of leveraging type information to simplify this PR.
3. check whether inited use type information as well, i.e. for types of DeepSpeedEngine we directly return
inited == Trueinstead of checking for flags.
What do you think?
Got it. Yeah 3 is technically what should be implemented instead of letting flags just flying around. Will get the update EOD.
@tjruwase Thanks for the suggestion. Please check the updated PR.
@tjruwase Hi Olatunji, are there any further modifications that you'd like me to do?
@tjruwase Hi Olatunji, are there any further modifications that you'd like me to do?
@traincheck-team - could you look into the failures on the nv-torch-latest-v100 test? We've stabilized the CI and these look to be real failures.
@loadams Thanks for getting back so quickly and I apologize for the late reply.
I inspected the 9 tests that failed. It appears to me that these tests failed because they initialized deepspeed using the same model multiple times, which is the exact behavior that this PR is trying to forbid. The PR itself is not buggy.
For example:
- in
tests/unit/checkpoint/test_other_optimizer.TestOtherOptimizerCheckpoint::test_checkpoint_unfused_optimizerat https://github.com/microsoft/DeepSpeed/blob/c08e69f21238f15bfe0e3779170fefa2f75d4c7e/tests/unit/checkpoint/test_other_optimizer.py#L61-L69. Twocheckpoint_correctness_verificationwere invoked with the samemodelsargument. - in
tests/unit/runtime/half_precision/onebit/test_onebit.TestZeroOneAdamCheckpointing::testat https://github.com/microsoft/DeepSpeed/blob/c08e69f21238f15bfe0e3779170fefa2f75d4c7e/tests/unit/runtime/half_precision/onebit/test_onebit.py#L589-L617. Twodeepspeed.initializewere invoked with exactly the samemodelparameter.
As far as I see there are two solutions:
- Modify the unit tests to use separate model copies. But I don't think you guys may want to risk doing this, but I feel this is the "right" thing to do.
- **Optionally suppress the exceptions by env or config vars". This might be an easier and more practical solution.
FYI @traincheck-team - there still appear to be some unit test errors if you could investigate.
Thanks for the heads up. Will do that soon.
On Fri, Jan 31, 2025 at 15:03 Logan Adams @.***> wrote:
FYI @traincheck-team https://github.com/traincheck-team - there still appear to be some unit test errors if you could investigate.
— Reply to this email directly, view it on GitHub https://github.com/microsoft/DeepSpeed/pull/6874#issuecomment-2628298592, or unsubscribe https://github.com/notifications/unsubscribe-auth/BNAPJE3J6B3C3Z6IOWKCO5L2NPJKDAVCNFSM6AAAAABTVBD4X2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRYGI4TQNJZGI . You are receiving this because you were mentioned.Message ID: @.***>
FYI @traincheck-team - you'll need to sign off on the DCO requirement now that we've changed the GH organization/structure. Let me know if you have any questions on that.
@loadams I tried to amend the six problematic commits using git rebase, but since there were previous merges from main into this branch, it caused conflicts and made things messy (e.g. see the requested reviewers of this PR as code owners).
To fix this, I plan to create a new local branch from main, reapply the necessary changes, and then force-push it to replace this branch. This should give us a clean history while preserving the required commits.
Let me know if you have any concerns or if there’s a better way to proceed!
@loadams Hi Logan, I apologize for the late reply. I’ve reviewed the 9 unit test failures in the recent workflow run: https://github.com/deepspeedai/DeepSpeed/actions/runs/13205140637/job/36866442471.
My understanding is that these failures are caused by repeated model initialization, which this PR now detects and raises exceptions for.
Evidences:
- Inspecting the affected locations confirms that the exception is raised precisely where repeated initialization occurs.
Impacted Tests:
These failed unit tests span 3 files
test_onebit.py:249,test_onebit.py:616,test_onebit.py:997,test_other_optimizer.py:68,test_other_optimizer.py:107andtest_zero_optimizer.py:365.
Suggested Fix:
The solution is straightforward and manageable as—ensure a new model object is created before the second initialization, which does not require modification to any critical logic. I suggest handling this in a separate PR to fix these tests.
If there are any additional failures beyond these 9, please let me know.