physicsnemo icon indicating copy to clipboard operation
physicsnemo copied to clipboard

Expose all torch.distributed.init_process_group parameters in the DistributedManager

Open akshaysubr opened this issue 1 year ago • 4 comments

Modulus Pull Request

Description

Adding kwargs to DistributedManager.initialize to pass down to torch.distributed.init_process_group. Added a test to specifically check that the timeout parameter gets passed down to torch.

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [x] The documentation is up to date with these changes.
  • [x] The CHANGELOG.md is up to date with these changes.
  • [ ] An issue is linked to this pull request.

Dependencies

None

akshaysubr avatar Oct 12 '24 00:10 akshaysubr

How does it behave if you pass a kwarg which has already been passed explicitly, for example rank or world_size? Will that overwrite the previous one?

azrael417 avatar Oct 14 '24 05:10 azrael417

That's a good point. Maybe should pop the explicitly specified kwargs out before passing them down?

akshaysubr avatar Oct 16 '24 15:10 akshaysubr

@akshaysubr do we want to merge this PR before the release?

mnabian avatar Nov 06 '24 01:11 mnabian

@mnabian Yes, we should merge this before the release. This is a fairly low risk PR I think but exposes certain mechanisms for more advanced usage. I think we can merge this as is and add other functionality that come up in subsequent PRs.

akshaysubr avatar Nov 09 '24 01:11 akshaysubr