pytorch
pytorch copied to clipboard
[CI] Add Compiled DDP / Compiled FSDP2 / compute-comm reordering tests to test_inductor_distributed
test_replicate_with_compiler.py and test_fully_shard_compile.py requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).
This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now.
Stack from ghstack (oldest at bottom):
- #137763
- #135273
- #137161
- -> #138178
cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/138178
- :page_facing_up: Preview Python docs built from this PR
- :page_facing_up: Preview C++ docs built from this PR
- :question: Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours
Note: Links to docs will display an error until the docs builds have been completed.
:x: 1 New Failure
As of commit 79212878d98c44902593d5be7a327b8bce28c07a with merge base 1f349eed61e21787611e2f1830581d230ceefd6b ():
NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes.
@pytorchbot merge
Merge started
Your change will be merged once all checks pass (ETA 0-4 Hours).
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
@pytorchbot revert -m 'Sorry for reverting your change, but the new tests are failing inductor distributed jobs' -c nosignal
distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_compile_backward_only GH job link HUD commit link
Let me add ciflow/inductor to the PR to get that signals
@pytorchbot successfully started a revert job. Check the current status here. Questions? Feedback? Please reach out to the PyTorch DevX Team
@yf225 your PR has been successfully reverted.
Depends on https://github.com/pytorch/pytorch/pull/138174
@pytorchbot merge
Merge started
Your change will be merged once all checks pass (ETA 0-4 Hours).
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
Merge failed
Reason: 1 jobs have failed, first few of them are: inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_distributed, 1, 1, linux.g5.12xlarge.nvidia.gpu)
Details for Dev Infra team
Raised by workflow job
@pytorchbot merge
Merge started
Your change will be merged once all checks pass (ETA 0-4 Hours).
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
@pytorchbot merge -f "skipping the only stuck CI job inductor_torchbench_smoketest_perf which is not testing this path"
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command For more information see pytorch-bot wiki.
Merge started
Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
@pytorchbot revert -m 'because https://github.com/pytorch/pytorch/pull/138174 is reverted, we need to revert this too'
❌ 🤖 pytorchbot command failed:
@pytorchbot revert: error: the following arguments are required: -c/--classification
usage: @pytorchbot revert -m MESSAGE -c
{nosignal,ignoredsignal,landrace,weird,ghfirst}
Try @pytorchbot --help for more info.
@pytorchbot revert -m 'because https://github.com/pytorch/pytorch/pull/138174 is reverted, we need to revert this too' -c weird
@pytorchbot successfully started a revert job. Check the current status here. Questions? Feedback? Please reach out to the PyTorch DevX Team
@yf225 your PR has been successfully reverted.
@pytorchbot merge
Merge started
Your change will be merged once all checks pass (ETA 0-4 Hours).
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
Merge failed
Reason: 1 jobs have failed, first few of them are: inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100)
Details for Dev Infra team
Raised by workflow job
@pytorchbot merge
Merge started
Your change will be merged once all checks pass (ETA 0-4 Hours).
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
Merge failed
Reason: 1 jobs have failed, first few of them are: inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100)
Details for Dev Infra team
Raised by workflow job
@pytorchbot merge
Merge started
Your change will be merged once all checks pass (ETA 0-4 Hours).
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
Merge failed
Reason: 1 jobs have failed, first few of them are: periodic / linux-focal-rocm6.2-py3.10 / test (distributed, 1, 3, linux.rocm.gpu)
Details for Dev Infra team
Raised by workflow job
@pytorchbot merge -f "fixed the failing test, other tests are confirmed working by previous CI runs"