builder icon indicating copy to clipboard operation
builder copied to clipboard

Decommission CUDA 11.3

Open atalman opened this issue 3 years ago • 3 comments

As PyTorch Dev Infra developer I want to deprecate the Cuda 11.3 support and builds for CI/CD. We consider performing following actions:

  • Removing Pull and trunk jobs: https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=11.3
  • Remove Binary builds for 11_3: https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=11_3
  • Remove nightly builds for 11_3: https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=11.3

The stable CUDA version will be 11.6 for Pytorch Release 1.13.

Issue: Removing cuda 11.3 completely creates a gap between cuda 11.4 which used internally and cuda 11.6 which is planned to be used in the CI. We may observe more failures on internal tests when merging GH1 code

Hence following solutions are possible:

  1. Update and hope nothing will happen (and roll back to 11.3 if it will)
  2. Keep 11.3 base tests as trunk and pull only job
  3. Update 10.2 to 11.4 (but no need for nighlty builds, just CI)

I tend toward solution 2 following with solution 3 since it provides the coverage we need and for GH1 failures while removes 11.3 from official release. And we can implement solution 3 as better engineering task as a followup.

cc @ptrblck @malfet @seemethere @ngimel

atalman avatar Sep 06 '22 21:09 atalman

Option 2 seems the best from my POV

seemethere avatar Sep 07 '22 17:09 seemethere

Option 2 will create problem as 11.3 compiler is super slow. I propose we just kill 11.3 and see the increases in number of cuda-related regressions when importing changes. If it spikes up, then we should spin 11.4 CI-only builds, to mimic internal behaviour.

malfet avatar Sep 08 '22 21:09 malfet

Spinning 11.4 CI only builds seems like a project on its own. We usually spend around 2-3 weeks upgrading CUDA: https://github.com/pytorch/builder/blob/main/CUDA_UPGRADE_GUIDE.MD However to secure internal build, we should probably do that.

If we remove 11.3 completely now: We are looking into possibilities of breaking the system due to new data types usage or new methods compatible with 11.6 but not compatible with 11.4. These are often not easy to analyze, and can be costly to debug each time.

As we migrate to latest cuda version 11.7, 11.8 etc.. The gap between internal and external version will increase, and we will surely hit this issue if not now, later in 6 month or so.

I think we better off migrating from 11.3 directly to 11.4 for the two CI jobs. Not removing the 11.3, but replacing 11.3 by 11.4. I think this would be safest and easiest way forward.

atalman avatar Sep 08 '22 21:09 atalman