[RFC] Implement gloo abort for graceful shutdown
In https://github.com/pytorch/pytorch/issues/130345 it was requested to implement a ProcessGroupGloo.shutdown() for faster recovery from distributed rank failures. This PR is a first step into accomplishing the proper shutdown. The second step would be implementing gloo::abort() within the PyTorch's ProcessGroupGloo.
Sorry for the delay. Are you able to add a test for this change?
Ignore the CI breakage for now. I'm trying to revive the CI for this repository.
Sorry for the delay. Are you able to add a test for this change?
Sure, I will add a test and resolve the merge conflicts soon.
Hey @c-p-i-o how does the PR look to you? Do you think it is ready to merge? Please let me know if you have any comments.
@c-p-i-o has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
Hey @c-p-i-o can you please let me know what tests are failing? Also what kind of linter is used? Would just clang-format be enough to resolve lint errors?
Hey @c-p-i-o can you please let me know what tests are failing? Also what kind of linter is used? Would just clang-format be enough to resolve lint errors?
Sorry for the delay here.
- CLANGFORMAT errors.
- Some internal CI failed on this diff. Re-running internal CI and will report back.
Grr. Some failures are internal to Meta when they try to build this change.
Let me see if I can address these on the internal side.
I do not see any relations to the current PR on this message. How do I reproduce it locally?
Grr. Some failures are internal to Meta when they try to build this change.
Let me see if I can address these on the internal side.
I ended up on this PR while reviewing some nvidia framework docs.
I am wondering what is blocking for this and if previously mentioned issues still exist and blocker here.
Let me see if I can address these on the internal side.