pytorch icon indicating copy to clipboard operation
pytorch copied to clipboard

S390x ci periodic tests

Open AlekseiNikiforovIBM opened this issue 9 months ago • 5 comments

Periodically run testsuite for s390x

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

AlekseiNikiforovIBM avatar May 02 '24 15:05 AlekseiNikiforovIBM

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/125401

Note: Links to docs will display an error until the docs builds have been completed.

:x: 3 New Failures, 1 Unrelated Failure

As of commit e3474d4bd72e20dda8a59a64522837b09db540b8 with merge base 73c10a04f635c00c6a198763d5e498b5f256f15d (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar May 02 '24 15:05 pytorch-bot[bot]

Please seek CI approval before scheduling CIFlow labels

pytorch-bot[bot] avatar May 03 '24 17:05 pytorch-bot[bot]

@pytorchbot rebase

AlekseiNikiforovIBM avatar May 14 '24 08:05 AlekseiNikiforovIBM

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot avatar May 14 '24 08:05 pytorchmergebot

Successfully rebased s390x_ci_periodic_tests onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout s390x_ci_periodic_tests && git pull --rebase)

pytorchmergebot avatar May 14 '24 08:05 pytorchmergebot

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

github-actions[bot] avatar Jul 29 '24 15:07 github-actions[bot]

this one's for dev infra, I fear...

ezyang avatar Aug 05 '24 20:08 ezyang

@malfet @huydhn , could you please take a look at this PR? And if it gets merged, where could be seen the results of periodic tests of new workflow?

AlekseiNikiforovIBM avatar Sep 02 '24 14:09 AlekseiNikiforovIBM

build took 60 minutes tests took: 150, 110, 65, 55, 115, 85, 50, 70, 105, 110 minutes (split into 10 shards) There are currently 20 runners, used for everything: nightly binaries, CI runs, and possibly these tests as well.

As for _linux-build.yml vs _linux-build-s390x.yml, the main difference between them is that s390x version doesn't have "docker-image" output.

https://github.com/pytorch/pytorch/blob/main/.github/workflows/_linux-build.yml#L133-L151

It's skipped in s390x since those calculations involve AWS access, which s390x runners don't have. Here's the access: https://github.com/pytorch/test-infra/blob/main/.github/actions/calculate-docker-image/action.yml#L94-L102

In addition to that, since s390x runners don't have AWS access, data between build and test workers is passed via GHA. I think it was mentioned that using AWS is preferrable. Would that be problem in that case?

I was not yet successful in skipping this step on s390x and not failing build. I can't find logs now, but I can try reproducing it again.

As for _linux-test.yml vs _linux-test-s390x.yml, there are not so many differences, I will try porting all important parts to original version.

AlekseiNikiforovIBM avatar Sep 02 '24 15:09 AlekseiNikiforovIBM

I'm getting similar failures related to numpy 2.0.0rc1 here: https://github.com/pytorch/pytorch/pull/134952 https://github.com/pytorch/pytorch/actions/runs/10827942887/job/30042209373?pr=134952

So I guess it's not related to my changes.

AlekseiNikiforovIBM avatar Sep 12 '24 12:09 AlekseiNikiforovIBM

@pytorchbot rebase

huydhn avatar Sep 12 '24 16:09 huydhn

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot avatar Sep 12 '24 16:09 pytorchmergebot

Successfully rebased s390x_ci_periodic_tests onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout s390x_ci_periodic_tests && git pull --rebase)

pytorchmergebot avatar Sep 12 '24 16:09 pytorchmergebot

Answer to the capacity question https://github.com/pytorch/pytorch/pull/125399#issuecomment-2345746062

huydhn avatar Sep 20 '24 17:09 huydhn