pytorch icon indicating copy to clipboard operation
pytorch copied to clipboard

[BE] split seq_id to collective_seq_id and p2p_seq_id

Open c-p-i-o opened this issue 1 year ago • 4 comments

Stack from ghstack (oldest at bottom):

  • #126581
  • -> #125727

Summary: Split out seq_id into collective_seq_id and p2p_seq_id. The main idea here is that collectives that go to all machines should have identical collective_seq_id and therefore it makes it easier to spot if one of machines isn't handling a collective operation. Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync.

Resolves issue: https://github.com/pytorch/pytorch/issues/125173

Test Plan: Unit tests.

Reviewers:

Subscribers:

Tasks:

Tags:

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

c-p-i-o avatar May 08 '24 00:05 c-p-i-o

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/125727

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: No Failures

As of commit 8a9f70abf24d15edee74ca2107fdf6ecbabe4a5d with merge base cd3a71f754a2248bcfe500de7c9860bd7d2002bf (image): :green_heart: Looks good so far! There are no failures yet. :green_heart:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar May 08 '24 00:05 pytorch-bot[bot]

this mostly lgtm, i didn't review carefully

one thought is whether we want to (correctly) bump version major, or, just keep old seq_id and add new ones and later deprecate it.

I think its worth checking if anyone's gonna be broken by this, but since not many users have developed scripts yet its probably fine to just bump the major and drop the old key so we have less baggage. anyone else have a thought on that?

wconstab avatar May 08 '24 20:05 wconstab

one thought is whether we want to (correctly) bump version major, or, just keep old seq_id and add new ones and later deprecate it.

@wconstab - do you have a preference? Shall we keep the old seq_id? The only one thing I'd have to fix up is the unit tests and keep the old field. But just wanted to make sure if that work is worth it. How do I find out I won't break anyone?

c-p-i-o avatar May 13 '24 23:05 c-p-i-o

I'd just bump the major version number. We are still iterating on what needs to go into this and the fix for any manual scripts is easy.

zdevito avatar May 14 '24 18:05 zdevito

@pytorchbot merge

c-p-i-o avatar May 20 '24 23:05 c-p-i-o

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging Check the merge workflow status here

pytorchmergebot avatar May 20 '24 23:05 pytorchmergebot

Merge failed

Reason: New commits were pushed while merging. Please rerun the merge command.

Details for Dev Infra team Raised by workflow job

pytorchmergebot avatar May 21 '24 00:05 pytorchmergebot

@pytorchbot merge

c-p-i-o avatar May 21 '24 00:05 c-p-i-o

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging Check the merge workflow status here

pytorchmergebot avatar May 21 '24 00:05 pytorchmergebot