Expose all torch.distributed.init_process_group parameters in the DistributedManager

Open akshaysubr opened this issue 1 year ago • 4 comments

Modulus Pull Request

Description

Adding kwargs to DistributedManager.initialize to pass down to torch.distributed.init_process_group. Added a test to specifically check that the timeout parameter gets passed down to torch.

Checklist

[x] I am familiar with the Contributing Guidelines.
[x] New or existing tests cover these changes.
[x] The documentation is up to date with these changes.
[x] The CHANGELOG.md is up to date with these changes.
[ ] An issue is linked to this pull request.

Dependencies

None

Oct 12 '24 00:10 akshaysubr

How does it behave if you pass a kwarg which has already been passed explicitly, for example rank or world_size? Will that overwrite the previous one?

Oct 14 '24 05:10 azrael417

That's a good point. Maybe should pop the explicitly specified kwargs out before passing them down?

Oct 16 '24 15:10 akshaysubr

@akshaysubr do we want to merge this PR before the release?

Nov 06 '24 01:11 mnabian

@mnabian Yes, we should merge this before the release. This is a fairly low risk PR I think but exposes certain mechanisms for more advanced usage. I think we can merge this as is and add other functionality that come up in subsequent PRs.

Nov 09 '24 01:11 akshaysubr