gloo icon indicating copy to clipboard operation
gloo copied to clipboard

[RFC] Implement gloo abort for graceful shutdown

Open Aidyn-A opened this issue 1 year ago • 10 comments

In https://github.com/pytorch/pytorch/issues/130345 it was requested to implement a ProcessGroupGloo.shutdown() for faster recovery from distributed rank failures. This PR is a first step into accomplishing the proper shutdown. The second step would be implementing gloo::abort() within the PyTorch's ProcessGroupGloo.

Aidyn-A avatar Sep 25 '24 19:09 Aidyn-A

Sorry for the delay. Are you able to add a test for this change?

c-p-i-o avatar Nov 01 '24 21:11 c-p-i-o

Ignore the CI breakage for now. I'm trying to revive the CI for this repository.

c-p-i-o avatar Nov 01 '24 21:11 c-p-i-o

Sorry for the delay. Are you able to add a test for this change?

Sure, I will add a test and resolve the merge conflicts soon.

Aidyn-A avatar Nov 05 '24 09:11 Aidyn-A

Hey @c-p-i-o how does the PR look to you? Do you think it is ready to merge? Please let me know if you have any comments.

Aidyn-A avatar Nov 15 '24 14:11 Aidyn-A

@c-p-i-o has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Dec 03 '24 18:12 facebook-github-bot

Hey @c-p-i-o can you please let me know what tests are failing? Also what kind of linter is used? Would just clang-format be enough to resolve lint errors?

Aidyn-A avatar Dec 05 '24 14:12 Aidyn-A

Hey @c-p-i-o can you please let me know what tests are failing? Also what kind of linter is used? Would just clang-format be enough to resolve lint errors?

Sorry for the delay here.

  1. CLANGFORMAT errors.
  2. Some internal CI failed on this diff. Re-running internal CI and will report back.

c-p-i-o avatar Jan 07 '25 23:01 c-p-i-o

Grr. Some failures are internal to Meta when they try to build this change. Screenshot 2025-01-07 at 4 49 54 PM Let me see if I can address these on the internal side.

c-p-i-o avatar Jan 08 '25 00:01 c-p-i-o

I do not see any relations to the current PR on this message. How do I reproduce it locally?

Grr. Some failures are internal to Meta when they try to build this change. Screenshot 2025-01-07 at 4 49 54 PM Let me see if I can address these on the internal side.

Aidyn-A avatar Jan 17 '25 14:01 Aidyn-A

I ended up on this PR while reviewing some nvidia framework docs.

I am wondering what is blocking for this and if previously mentioned issues still exist and blocker here.

pramodk avatar Jul 21 '25 11:07 pramodk