physicsnemo icon indicating copy to clipboard operation
physicsnemo copied to clipboard

Safeguarding against usage of uninitialized DistributedManager

Open akshaysubr opened this issue 1 year ago • 6 comments

Modulus Pull Request

Description

closes #474

Should be merged in after #469

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [x] The documentation is up to date with these changes.
  • [x] The CHANGELOG.md is up to date with these changes.
  • [x] An issue is linked to this pull request.

Dependencies

None

akshaysubr avatar Apr 25 '24 06:04 akshaysubr

/blossom-ci

akshaysubr avatar Apr 25 '24 22:04 akshaysubr

/blossom-ci

mnabian avatar Apr 27 '24 00:04 mnabian

/blossom-ci

akshaysubr avatar Apr 29 '24 20:04 akshaysubr

/blossom-ci

akshaysubr avatar May 01 '24 05:05 akshaysubr

What it does makes sense. Is this fix supposed to safeguard against initialized DM but requesting an uninitialized distributed group?

azrael417 avatar May 02 '24 06:05 azrael417

@azrael417 Not quite. This PR is safeguarding against using the manager before calling DistributedManager.initialize() first. There was a bug in CorrDiff where this was silently happening causing a multi GPU job to behave like independent single GPU jobs since that's the default.

akshaysubr avatar May 08 '24 17:05 akshaysubr

/blossom-ci

akshaysubr avatar May 29 '24 00:05 akshaysubr

/blossom-ci

akshaysubr avatar May 29 '24 03:05 akshaysubr