pytorch
pytorch copied to clipboard
[CI] Experiment with a newer CUDA driver
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/96904
- :page_facing_up: Preview Python docs built from this PR
- :page_facing_up: Preview C++ docs built from this PR
- :question: Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours
Note: Links to docs will display an error until the docs builds have been completed.
:x: 2 Failures
As of commit 685ef74501ff308d2356a5cc19750193919aa921:
NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes.
@pytorchbot rebase
@pytorchbot successfully started a rebase job. Check the current status here
Successfully rebased gh/desertfire/102/orig
onto refs/remotes/origin/viable/strict
, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/96904
)
@pytorchbot rebase
@pytorchbot successfully started a rebase job. Check the current status here
Tried to rebase and push PR #96904, but it was already up to date
LGTM! Do you plan to commit this eventually once all test pass? I have been thinking about doing this for a while to see if this helps reduce flakiness on G5 runners with their crashing
No CUDA GPU
issues. We can have this running in trunk for a while and monitor
I plan to rebase a few more times to see if it solves the flaky accuracy issue in TIMM.
@pytorchbot rebase
@pytorchbot successfully started a rebase job. Check the current status here
Successfully rebased gh/desertfire/102/orig
onto refs/remotes/origin/viable/strict
, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/96904
)
@pytorchbot rebase
@pytorchbot successfully started a rebase job. Check the current status here
Successfully rebased gh/desertfire/102/orig
onto refs/remotes/origin/viable/strict
, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/96904
)
@pytorchbot rebase
@pytorchbot successfully started a rebase job. Check the current status here
Successfully rebased gh/desertfire/102/orig
onto refs/remotes/origin/viable/strict
, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/96904
)
https://hud.pytorch.org/pr/96904#12071246663 shows "RuntimeError: No CUDA GPUs are available"
@pytorchbot rebase
@pytorchbot successfully started a rebase job. Check the current status here
Successfully rebased gh/desertfire/102/orig
onto refs/remotes/origin/viable/strict
, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/96904
)
@pytorchbot rebase
@pytorchbot successfully started a rebase job. Check the current status here
Successfully rebased gh/desertfire/102/orig
onto refs/remotes/origin/viable/strict
, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/96904
)
I was hoping this can resolve the flaky accuracy failure on TIMM models, but dla102
still fails once in a while (https://github.com/pytorch/pytorch/actions/runs/4450740075/jobs/7817003045).
@pytorchbot rebase
@pytorchbot successfully started a rebase job. Check the current status here
Successfully rebased gh/desertfire/102/orig
onto refs/remotes/origin/viable/strict
, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/96904
)
@malfet @seemethere , I was hoping this driver can help to fix the flaky test, but it doesn't. On the other hand, the new driver seems stable enough. Should I merge this PR? Or you guys will do a more proper upgrade.
Let's get this merge. I want to see if this helps with the flaky driver crash on the runner. I'll monitor trunk to make sure that it doesn't cause any issue.