ray icon indicating copy to clipboard operation
ray copied to clipboard

[RLlib] Bump torch versions and CUDA toolkit

Open ArturNiederfahrenhorst opened this issue 2 years ago • 2 comments

Why are these changes needed?

After conducting a row of tests, we have detected a throughput regression in PyTorch, primarily in our APPO release test. This has been linked to CUDA toolkit versions != 10.x (meaning CUDA toolkit==11.x and therefore cudnn 8.x). We have tracked this down using the torch profiler to two cudnn operations that use up a lot more time when using CUDA toolkit==11.x and cudnn==8.x.

According to the release notes of all cuDNN Subversions 8.X.Y, there are a number of performance regressions from cuDNN 7.6.X to be found as “known issues” with cuDNN 8.X.Y.. It is possible that some of these, vaguely formulated, issues apply to our networks and furthermore possible, that we hit undocumented issues with cuDNN. Torch does not offer deeper interaction with CUDA or cuDNN than what is to see with the used profiler above. We are therefore limited to comparing our observations of version numbers to the release notes. I conclude that on newer hardware, performances between CUDA toolkit 10.X and 11.X / cuDNN 7.X and 8.X seem comparable (with CUDA toolkit 10.x and cuDNN 7.X still leading to the best results in the benchmark). On older hardware, the legacy CUDA toolkit 10.X and cuDNN 7.X outperform by far. Without being able to profile all combinations for the algorithms used by cuDNN, it seems to me that we have to accept that calculations done on our specific networks by cuDNN 7.X are faster than the calculations done by 8.X due to a regression in the cuDNN algorithm selection and/or execution time. As stated above, this effect varies with the used GPU.

Checks

  • [x] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [x] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [x] Unit tests
    • [x] Release tests
    • [ ] This PR is not tested :(

ArturNiederfahrenhorst avatar Aug 01 '22 14:08 ArturNiederfahrenhorst

This PR includes both places that a change would be effective in.

  1. The docker ml requirements affect all GPU tests and 2) the app config only affects RLLib.

ArturNiederfahrenhorst avatar Aug 08 '22 13:08 ArturNiederfahrenhorst

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stale[bot] avatar Sep 08 '22 17:09 stale[bot]

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stale[bot] avatar Oct 29 '22 12:10 stale[bot]

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

stale[bot] avatar Nov 12 '22 18:11 stale[bot]