pytorch icon indicating copy to clipboard operation
pytorch copied to clipboard

[CI] Experiment with a newer CUDA driver

Open desertfire opened this issue 1 year ago • 21 comments

Stack from ghstack (oldest at bottom):

  • -> #96904

desertfire avatar Mar 15 '23 23:03 desertfire

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/96904

Note: Links to docs will display an error until the docs builds have been completed.

:x: 2 Failures

As of commit 685ef74501ff308d2356a5cc19750193919aa921:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar Mar 15 '23 23:03 pytorch-bot[bot]

@pytorchbot rebase

desertfire avatar Mar 16 '23 12:03 desertfire

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot avatar Mar 16 '23 13:03 pytorchmergebot

Successfully rebased gh/desertfire/102/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/96904)

pytorchmergebot avatar Mar 16 '23 13:03 pytorchmergebot

@pytorchbot rebase

desertfire avatar Mar 16 '23 16:03 desertfire

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot avatar Mar 16 '23 16:03 pytorchmergebot

Tried to rebase and push PR #96904, but it was already up to date

pytorchmergebot avatar Mar 16 '23 16:03 pytorchmergebot

LGTM! Do you plan to commit this eventually once all test pass? I have been thinking about doing this for a while to see if this helps reduce flakiness on G5 runners with their crashingNo CUDA GPU issues. We can have this running in trunk for a while and monitor

I plan to rebase a few more times to see if it solves the flaky accuracy issue in TIMM.

desertfire avatar Mar 16 '23 17:03 desertfire

@pytorchbot rebase

desertfire avatar Mar 16 '23 18:03 desertfire

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot avatar Mar 16 '23 18:03 pytorchmergebot

Successfully rebased gh/desertfire/102/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/96904)

pytorchmergebot avatar Mar 16 '23 18:03 pytorchmergebot

@pytorchbot rebase

desertfire avatar Mar 16 '23 22:03 desertfire

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot avatar Mar 16 '23 22:03 pytorchmergebot

Successfully rebased gh/desertfire/102/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/96904)

pytorchmergebot avatar Mar 16 '23 22:03 pytorchmergebot

@pytorchbot rebase

desertfire avatar Mar 17 '23 04:03 desertfire

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot avatar Mar 17 '23 04:03 pytorchmergebot

Successfully rebased gh/desertfire/102/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/96904)

pytorchmergebot avatar Mar 17 '23 04:03 pytorchmergebot

https://hud.pytorch.org/pr/96904#12071246663 shows "RuntimeError: No CUDA GPUs are available"

desertfire avatar Mar 17 '23 12:03 desertfire

@pytorchbot rebase

desertfire avatar Mar 17 '23 12:03 desertfire

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot avatar Mar 17 '23 12:03 pytorchmergebot

Successfully rebased gh/desertfire/102/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/96904)

pytorchmergebot avatar Mar 17 '23 12:03 pytorchmergebot

@pytorchbot rebase

desertfire avatar Mar 17 '23 18:03 desertfire

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot avatar Mar 17 '23 18:03 pytorchmergebot

Successfully rebased gh/desertfire/102/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/96904)

pytorchmergebot avatar Mar 17 '23 19:03 pytorchmergebot

I was hoping this can resolve the flaky accuracy failure on TIMM models, but dla102 still fails once in a while (https://github.com/pytorch/pytorch/actions/runs/4450740075/jobs/7817003045).

desertfire avatar Mar 18 '23 12:03 desertfire

@pytorchbot rebase

desertfire avatar Mar 18 '23 12:03 desertfire

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot avatar Mar 18 '23 12:03 pytorchmergebot

Successfully rebased gh/desertfire/102/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/96904)

pytorchmergebot avatar Mar 18 '23 12:03 pytorchmergebot

@malfet @seemethere , I was hoping this driver can help to fix the flaky test, but it doesn't. On the other hand, the new driver seems stable enough. Should I merge this PR? Or you guys will do a more proper upgrade.

desertfire avatar Mar 20 '23 23:03 desertfire

Let's get this merge. I want to see if this helps with the flaky driver crash on the runner. I'll monitor trunk to make sure that it doesn't cause any issue.

huydhn avatar Mar 23 '23 06:03 huydhn