[BE] split seq_id to collective_seq_id and p2p_seq_id
Stack from ghstack (oldest at bottom):
- #126581
- -> #125727
Summary:
Split out seq_id into collective_seq_id and p2p_seq_id. The main idea here is that collectives that go to all machines should have identical collective_seq_id and therefore it makes it easier to spot if one of machines isn't handling a collective operation.
Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync.
Resolves issue: https://github.com/pytorch/pytorch/issues/125173
Test Plan: Unit tests.
Reviewers:
Subscribers:
Tasks:
Tags:
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/125727
- :page_facing_up: Preview Python docs built from this PR
- :page_facing_up: Preview C++ docs built from this PR
- :question: Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours
Note: Links to docs will display an error until the docs builds have been completed.
:white_check_mark: No Failures
As of commit 8a9f70abf24d15edee74ca2107fdf6ecbabe4a5d with merge base cd3a71f754a2248bcfe500de7c9860bd7d2002bf ():
:green_heart: Looks good so far! There are no failures yet. :green_heart:
This comment was automatically generated by Dr. CI and updates every 15 minutes.
this mostly lgtm, i didn't review carefully
one thought is whether we want to (correctly) bump version major, or, just keep old seq_id and add new ones and later deprecate it.
I think its worth checking if anyone's gonna be broken by this, but since not many users have developed scripts yet its probably fine to just bump the major and drop the old key so we have less baggage. anyone else have a thought on that?
one thought is whether we want to (correctly) bump version major, or, just keep old seq_id and add new ones and later deprecate it.
@wconstab - do you have a preference? Shall we keep the old seq_id?
The only one thing I'd have to fix up is the unit tests and keep the old field. But just wanted to make sure if that work is worth it.
How do I find out I won't break anyone?
I'd just bump the major version number. We are still iterating on what needs to go into this and the fix for any manual scripts is easy.
@pytorchbot merge
Merge started
Your change will be merged once all checks pass (ETA 0-4 Hours).
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
Merge failed
Reason: New commits were pushed while merging. Please rerun the merge command.
Details for Dev Infra team
Raised by workflow job
@pytorchbot merge
Merge started
Your change will be merged once all checks pass (ETA 0-4 Hours).
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here