DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Fix: forbid repeated deepspeed.initialize on training objects

Open traincheck-team opened this issue 1 year ago • 15 comments

Previously closed PR: https://github.com/microsoft/DeepSpeed/pull/6848

Partially Fixes: https://github.com/microsoft/DeepSpeed/issues/6772 https://github.com/microsoft/DeepSpeed/issues/6771 https://github.com/microsoft/DeepSpeed/issues/6770 by forbidding repeated initialization.

What are changed:

  1. Marking 'model', 'optimizer' and 'lr_scheduler' in the arguments of deepspeed.initialize with the flag ds_is_inited = True.
  2. Marking 'engine', 'engine.optimizer' and 'engine.lr_scheduler' in the return values of deepspeed.initialize with the flag ds_is_inited = True.
  3. When calling deepspeed.initialize, raise an exception if detected ds_is_inited == True in the input model, optimizer or lr_scheduler

Expected Behavior: Forbid repeated deepspeed.initialize invocations on model, optimizer and lr_scheduler objects.

traincheck-team avatar Dec 16 '24 00:12 traincheck-team

@microsoft-github-policy-service agree

traincheck-team avatar Dec 16 '24 00:12 traincheck-team

This fix still has interference with existing unit tests. Let me double check before we proceed.

traincheck-team avatar Dec 16 '24 20:12 traincheck-team

The unit tests in tests/unit/runtime/test_ds_initialize.py all passed. The PR is ready for review @tjruwase.

I am not able to check other unit tests due to GPU memory constraint.

traincheck-team avatar Dec 16 '24 21:12 traincheck-team

Thanks for the review @tjruwase.

  • I added handling for callable types for optimizer and lr_scheduler. The handling is to only mark is_ds_inited for objects instead of callables, as the callables are not stateful and reuse should be allowed.

Regarding,

I think this call should be moved into _mark_trainobjs_initialized()

I think _assert_trainobjs_not_inited should still be separated from _mark_trainobjs_initialized as _mark_trainobjs_initialized is also called on the wrapped model and optimizers before exiting from deepspeed.initialize. The wrapped models may already have is_ds_inited being True since in DeepSpeedEngine all model flags will be passed through on the wrapper.

If we still want to keep _assert_trainobjs_not_inited inside _mark_trainobjs_initialized, we can do either of the three:

  1. add a flag to _mark_trainobjs_initialized to indicate whether the *not_inited` check should be performed
  2. add/check a flag using __dict__ rather than setattr/getattr
  3. check whether inited use type information as well, i.e. for types of DeepSpeedEngine we directly return inited == True instead of checking for flags.

traincheck-team avatar Dec 19 '24 18:12 traincheck-team

If we still want to keep _assert_trainobjs_not_inited inside _mark_trainobjs_initialized, we can do either of the three:

Thanks for raising this important design issue. Below are my thoughts and/or clarifications.

  1. DeepSpeed is designed to optimize PyTorch, and currently does this by wrapping PyTorch constructs, e.g., nn.Module, optim.Optimizer, optim.lr_scheduler, etc.
  2. To avoid nesting issues, DeepSpeed should not wrap DeepSpeed data structures, such as DeepSpeedEngine, DeepSpeedOptimizer, DeepSpeedDataLoader, etc.

Based on the above, I am aligned with your 3rd suggestion (below) of leveraging type information to simplify this PR.

3. check whether inited use type information as well, i.e. for types of DeepSpeedEngine we directly return inited == True instead of checking for flags.

What do you think?

tjruwase avatar Dec 28 '24 16:12 tjruwase

Got it. Yeah 3 is technically what should be implemented instead of letting flags just flying around. Will get the update EOD.

traincheck-team avatar Dec 28 '24 17:12 traincheck-team

@tjruwase Thanks for the suggestion. Please check the updated PR.

traincheck-team avatar Dec 30 '24 02:12 traincheck-team

@tjruwase Hi Olatunji, are there any further modifications that you'd like me to do?

traincheck-team avatar Jan 13 '25 21:01 traincheck-team

@tjruwase Hi Olatunji, are there any further modifications that you'd like me to do?

@traincheck-team - could you look into the failures on the nv-torch-latest-v100 test? We've stabilized the CI and these look to be real failures.

loadams avatar Jan 13 '25 22:01 loadams

@loadams Thanks for getting back so quickly and I apologize for the late reply.

I inspected the 9 tests that failed. It appears to me that these tests failed because they initialized deepspeed using the same model multiple times, which is the exact behavior that this PR is trying to forbid. The PR itself is not buggy.

For example:

  1. in tests/unit/checkpoint/test_other_optimizer.TestOtherOptimizerCheckpoint::test_checkpoint_unfused_optimizer at https://github.com/microsoft/DeepSpeed/blob/c08e69f21238f15bfe0e3779170fefa2f75d4c7e/tests/unit/checkpoint/test_other_optimizer.py#L61-L69. Two checkpoint_correctness_verification were invoked with the same models argument.
  2. in tests/unit/runtime/half_precision/onebit/test_onebit.TestZeroOneAdamCheckpointing::test at https://github.com/microsoft/DeepSpeed/blob/c08e69f21238f15bfe0e3779170fefa2f75d4c7e/tests/unit/runtime/half_precision/onebit/test_onebit.py#L589-L617. Two deepspeed.initialize were invoked with exactly the same model parameter.

As far as I see there are two solutions:

  1. Modify the unit tests to use separate model copies. But I don't think you guys may want to risk doing this, but I feel this is the "right" thing to do.
  2. **Optionally suppress the exceptions by env or config vars". This might be an easier and more practical solution.

traincheck-team avatar Jan 16 '25 21:01 traincheck-team

FYI @traincheck-team - there still appear to be some unit test errors if you could investigate.

loadams avatar Jan 31 '25 20:01 loadams

Thanks for the heads up. Will do that soon.

On Fri, Jan 31, 2025 at 15:03 Logan Adams @.***> wrote:

FYI @traincheck-team https://github.com/traincheck-team - there still appear to be some unit test errors if you could investigate.

— Reply to this email directly, view it on GitHub https://github.com/microsoft/DeepSpeed/pull/6874#issuecomment-2628298592, or unsubscribe https://github.com/notifications/unsubscribe-auth/BNAPJE3J6B3C3Z6IOWKCO5L2NPJKDAVCNFSM6AAAAABTVBD4X2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRYGI4TQNJZGI . You are receiving this because you were mentioned.Message ID: @.***>

traincheck-team avatar Feb 01 '25 14:02 traincheck-team

FYI @traincheck-team - you'll need to sign off on the DCO requirement now that we've changed the GH organization/structure. Let me know if you have any questions on that.

loadams avatar Feb 07 '25 17:02 loadams

@loadams I tried to amend the six problematic commits using git rebase, but since there were previous merges from main into this branch, it caused conflicts and made things messy (e.g. see the requested reviewers of this PR as code owners).

To fix this, I plan to create a new local branch from main, reapply the necessary changes, and then force-push it to replace this branch. This should give us a clean history while preserving the required commits.

Let me know if you have any concerns or if there’s a better way to proceed!

traincheck-team avatar Feb 09 '25 05:02 traincheck-team

@loadams Hi Logan, I apologize for the late reply. I’ve reviewed the 9 unit test failures in the recent workflow run: https://github.com/deepspeedai/DeepSpeed/actions/runs/13205140637/job/36866442471.

My understanding is that these failures are caused by repeated model initialization, which this PR now detects and raises exceptions for.

Evidences:

  • Inspecting the affected locations confirms that the exception is raised precisely where repeated initialization occurs.

Impacted Tests:

These failed unit tests span 3 files

  • test_onebit.py:249, test_onebit.py:616, test_onebit.py:997,
  • test_other_optimizer.py:68, test_other_optimizer.py:107 and
  • test_zero_optimizer.py:365.

Suggested Fix:

The solution is straightforward and manageable as—ensure a new model object is created before the second initialization, which does not require modification to any critical logic. I suggest handling this in a separate PR to fix these tests.

If there are any additional failures beyond these 9, please let me know.

traincheck-team avatar Feb 09 '25 06:02 traincheck-team