pytorch icon indicating copy to clipboard operation
pytorch copied to clipboard

[CI] Add Compiled DDP / Compiled FSDP2 / compute-comm reordering tests to test_inductor_distributed

Open yf225 opened this issue 1 year ago • 20 comments

test_replicate_with_compiler.py and test_fully_shard_compile.py requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now.

Stack from ghstack (oldest at bottom):

  • #137763
  • #135273
  • #137161
  • -> #138178

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

yf225 avatar Oct 17 '24 06:10 yf225

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/138178

Note: Links to docs will display an error until the docs builds have been completed.

:x: 1 New Failure

As of commit 79212878d98c44902593d5be7a327b8bce28c07a with merge base 1f349eed61e21787611e2f1830581d230ceefd6b (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar Oct 17 '24 06:10 pytorch-bot[bot]

@pytorchbot merge

yf225 avatar Oct 17 '24 08:10 yf225

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging Check the merge workflow status here

pytorchmergebot avatar Oct 17 '24 08:10 pytorchmergebot

@pytorchbot revert -m 'Sorry for reverting your change, but the new tests are failing inductor distributed jobs' -c nosignal

distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_compile_backward_only GH job link HUD commit link

Let me add ciflow/inductor to the PR to get that signals

huydhn avatar Oct 17 '24 17:10 huydhn

@pytorchbot successfully started a revert job. Check the current status here. Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot avatar Oct 17 '24 17:10 pytorchmergebot

@yf225 your PR has been successfully reverted.

pytorchmergebot avatar Oct 17 '24 17:10 pytorchmergebot

Depends on https://github.com/pytorch/pytorch/pull/138174

yf225 avatar Oct 17 '24 19:10 yf225

@pytorchbot merge

kwen2501 avatar Oct 18 '24 00:10 kwen2501

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging Check the merge workflow status here

pytorchmergebot avatar Oct 18 '24 00:10 pytorchmergebot

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_distributed, 1, 1, linux.g5.12xlarge.nvidia.gpu)

Details for Dev Infra team Raised by workflow job

pytorchmergebot avatar Oct 18 '24 01:10 pytorchmergebot

@pytorchbot merge

yf225 avatar Oct 18 '24 01:10 yf225

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging Check the merge workflow status here

pytorchmergebot avatar Oct 18 '24 01:10 pytorchmergebot

@pytorchbot merge -f "skipping the only stuck CI job inductor_torchbench_smoketest_perf which is not testing this path"

yf225 avatar Oct 18 '24 04:10 yf225

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command For more information see pytorch-bot wiki.

pytorchmergebot avatar Oct 18 '24 04:10 pytorchmergebot

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging Check the merge workflow status here

pytorchmergebot avatar Oct 18 '24 04:10 pytorchmergebot

@pytorchbot revert -m 'because https://github.com/pytorch/pytorch/pull/138174 is reverted, we need to revert this too'

yf225 avatar Oct 18 '24 17:10 yf225

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -c/--classification

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

pytorch-bot[bot] avatar Oct 18 '24 17:10 pytorch-bot[bot]

@pytorchbot revert -m 'because https://github.com/pytorch/pytorch/pull/138174 is reverted, we need to revert this too' -c weird

yf225 avatar Oct 18 '24 17:10 yf225

@pytorchbot successfully started a revert job. Check the current status here. Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot avatar Oct 18 '24 17:10 pytorchmergebot

@yf225 your PR has been successfully reverted.

pytorchmergebot avatar Oct 18 '24 17:10 pytorchmergebot

@pytorchbot merge

yf225 avatar Oct 20 '24 04:10 yf225

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging Check the merge workflow status here

pytorchmergebot avatar Oct 20 '24 04:10 pytorchmergebot

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100)

Details for Dev Infra team Raised by workflow job

pytorchmergebot avatar Oct 20 '24 04:10 pytorchmergebot

@pytorchbot merge

yf225 avatar Oct 20 '24 06:10 yf225

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging Check the merge workflow status here

pytorchmergebot avatar Oct 20 '24 06:10 pytorchmergebot

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100)

Details for Dev Infra team Raised by workflow job

pytorchmergebot avatar Oct 20 '24 07:10 pytorchmergebot

@pytorchbot merge

yf225 avatar Oct 20 '24 08:10 yf225

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging Check the merge workflow status here

pytorchmergebot avatar Oct 20 '24 08:10 pytorchmergebot

Merge failed

Reason: 1 jobs have failed, first few of them are: periodic / linux-focal-rocm6.2-py3.10 / test (distributed, 1, 3, linux.rocm.gpu)

Details for Dev Infra team Raised by workflow job

pytorchmergebot avatar Oct 20 '24 09:10 pytorchmergebot

@pytorchbot merge -f "fixed the failing test, other tests are confirmed working by previous CI runs"

yf225 avatar Oct 20 '24 19:10 yf225