pytorch [CI] Experiment with a newer CUDA driver

Stack from ghstack (oldest at bottom):

-> #96904

Mar 15 '23 23:03 desertfire

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/96904

:page_facing_up: Preview Python docs built from this PR
:page_facing_up: Preview C++ docs built from this PR
:question: Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

:x: 2 Failures

As of commit 685ef74501ff308d2356a5cc19750193919aa921:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Mar 15 '23 23:03 pytorch-bot[bot]

@pytorchbot rebase

Mar 16 '23 12:03 desertfire

@pytorchbot successfully started a rebase job. Check the current status here

Mar 16 '23 13:03 pytorchmergebot

Successfully rebased gh/desertfire/102/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/96904)

Mar 16 '23 13:03 pytorchmergebot

@pytorchbot rebase

Mar 16 '23 16:03 desertfire

@pytorchbot successfully started a rebase job. Check the current status here

Mar 16 '23 16:03 pytorchmergebot

Tried to rebase and push PR #96904, but it was already up to date

Mar 16 '23 16:03 pytorchmergebot

LGTM! Do you plan to commit this eventually once all test pass? I have been thinking about doing this for a while to see if this helps reduce flakiness on G5 runners with their crashingNo CUDA GPU issues. We can have this running in trunk for a while and monitor

I plan to rebase a few more times to see if it solves the flaky accuracy issue in TIMM.

Mar 16 '23 17:03 desertfire

@pytorchbot rebase

Mar 16 '23 18:03 desertfire

@pytorchbot successfully started a rebase job. Check the current status here

Mar 16 '23 18:03 pytorchmergebot

Successfully rebased gh/desertfire/102/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/96904)

Mar 16 '23 18:03 pytorchmergebot

@pytorchbot rebase

Mar 16 '23 22:03 desertfire

@pytorchbot successfully started a rebase job. Check the current status here

Mar 16 '23 22:03 pytorchmergebot

Successfully rebased gh/desertfire/102/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/96904)

Mar 16 '23 22:03 pytorchmergebot

@pytorchbot rebase

Mar 17 '23 04:03 desertfire

@pytorchbot successfully started a rebase job. Check the current status here

Mar 17 '23 04:03 pytorchmergebot

Successfully rebased gh/desertfire/102/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/96904)

Mar 17 '23 04:03 pytorchmergebot

https://hud.pytorch.org/pr/96904#12071246663 shows "RuntimeError: No CUDA GPUs are available"

Mar 17 '23 12:03 desertfire

@pytorchbot rebase

Mar 17 '23 12:03 desertfire

@pytorchbot successfully started a rebase job. Check the current status here

Mar 17 '23 12:03 pytorchmergebot

Successfully rebased gh/desertfire/102/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/96904)

Mar 17 '23 12:03 pytorchmergebot

@pytorchbot rebase

Mar 17 '23 18:03 desertfire

@pytorchbot successfully started a rebase job. Check the current status here

Mar 17 '23 18:03 pytorchmergebot

Successfully rebased gh/desertfire/102/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/96904)

Mar 17 '23 19:03 pytorchmergebot

I was hoping this can resolve the flaky accuracy failure on TIMM models, but dla102 still fails once in a while (https://github.com/pytorch/pytorch/actions/runs/4450740075/jobs/7817003045).

Mar 18 '23 12:03 desertfire

@pytorchbot rebase

Mar 18 '23 12:03 desertfire

@pytorchbot successfully started a rebase job. Check the current status here

Mar 18 '23 12:03 pytorchmergebot

Successfully rebased gh/desertfire/102/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/96904)

Mar 18 '23 12:03 pytorchmergebot

@malfet @seemethere , I was hoping this driver can help to fix the flaky test, but it doesn't. On the other hand, the new driver seems stable enough. Should I merge this PR? Or you guys will do a more proper upgrade.

Mar 20 '23 23:03 desertfire

Let's get this merge. I want to see if this helps with the flaky driver crash on the runner. I'll monitor trunk to make sure that it doesn't cause any issue.

Mar 23 '23 06:03 huydhn

pytorch pytorch copied to clipboard

[CI] Experiment with a newer CUDA driver

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/96904

:x: 2 Failures

pytorch
pytorch copied to clipboard