pytorch
pytorch copied to clipboard
S390x ci periodic tests
Periodically run testsuite for s390x
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/125401
- :page_facing_up: Preview Python docs built from this PR
- :page_facing_up: Preview C++ docs built from this PR
- :question: Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours
Note: Links to docs will display an error until the docs builds have been completed.
:x: 3 New Failures, 1 Unrelated Failure
As of commit e3474d4bd72e20dda8a59a64522837b09db540b8 with merge base 73c10a04f635c00c6a198763d5e498b5f256f15d ():
NEW FAILURES - The following jobs have failed:
- periodic / ios-build-test / build (default, 1, 1, macos-14-xlarge, SIMULATOR, arm64, 1, 0, 1) (gh)
-
pull / linux-focal-py3.12-clang10 / test (dynamo, 2, 3, lf.linux.2xlarge) (gh)
test_serialization.py::TestSerialization::test_skip_data_serialization_materialize_fake_False
-
pull / linux-focal-py3.12-clang10-experimental-split-build / test (dynamo, 2, 3, linux.2xlarge) (gh)
test_serialization.py::TestSerialization::test_skip_data_serialization_materialize_fake_False
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes.
Please seek CI approval before scheduling CIFlow labels
@pytorchbot rebase
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here
Successfully rebased s390x_ci_periodic_tests
onto refs/remotes/origin/viable/strict
, please pull locally before adding more changes (for example, via git checkout s390x_ci_periodic_tests && git pull --rebase
)
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale
.
Feel free to remove the Stale
label if you feel this was a mistake.
If you are unable to remove the Stale
label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale
label.Stale
pull requests will automatically be closed after 30 days of inactivity.
this one's for dev infra, I fear...
@malfet @huydhn , could you please take a look at this PR? And if it gets merged, where could be seen the results of periodic tests of new workflow?
build took 60 minutes tests took: 150, 110, 65, 55, 115, 85, 50, 70, 105, 110 minutes (split into 10 shards) There are currently 20 runners, used for everything: nightly binaries, CI runs, and possibly these tests as well.
As for _linux-build.yml vs _linux-build-s390x.yml, the main difference between them is that s390x version doesn't have "docker-image" output.
https://github.com/pytorch/pytorch/blob/main/.github/workflows/_linux-build.yml#L133-L151
It's skipped in s390x since those calculations involve AWS access, which s390x runners don't have. Here's the access: https://github.com/pytorch/test-infra/blob/main/.github/actions/calculate-docker-image/action.yml#L94-L102
In addition to that, since s390x runners don't have AWS access, data between build and test workers is passed via GHA. I think it was mentioned that using AWS is preferrable. Would that be problem in that case?
I was not yet successful in skipping this step on s390x and not failing build. I can't find logs now, but I can try reproducing it again.
As for _linux-test.yml vs _linux-test-s390x.yml, there are not so many differences, I will try porting all important parts to original version.
I'm getting similar failures related to numpy 2.0.0rc1 here: https://github.com/pytorch/pytorch/pull/134952 https://github.com/pytorch/pytorch/actions/runs/10827942887/job/30042209373?pr=134952
So I guess it's not related to my changes.
@pytorchbot rebase
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here
Successfully rebased s390x_ci_periodic_tests
onto refs/remotes/origin/viable/strict
, please pull locally before adding more changes (for example, via git checkout s390x_ci_periodic_tests && git pull --rebase
)
Answer to the capacity question https://github.com/pytorch/pytorch/pull/125399#issuecomment-2345746062