infrastructure icon indicating copy to clipboard operation
infrastructure copied to clipboard

Ansible request for NVidia CUDA toolkit

Open AswathySK opened this issue 1 year ago • 9 comments

The silent installation for NVIDIA toolkit is not successful. The NVIDIA GPU Computing Toolkit folder is not getting created causing build compiles to throw error saying CUDA_HOME not found.

The issue can be resolved by changing compiler_9.1 in the playbook to nvcc_9.1 https://docs.nvidia.com/cuda/archive/9.1/cuda-installation-guide-microsoft-windows/index.html

C:\temp\cuda_9.1.85_win10_network.exe -s nvcc_9.1 nvml_dev_9.1

9.1 version was released back in 2017, Is there any reason why we cant change it to a newer version? 12.0.0 offers support to most windows versions -10, 11, server 2016,2019 and 2022.

  • Bug in ansible playbook
  • Request for new playbook addition

AswathySK avatar Jun 05 '24 07:06 AswathySK

Any feelings on this @AdamBrousseau @pshipton ?

sxa avatar Jun 05 '24 19:06 sxa

@keithc-ca pls take a look.

pshipton avatar Jun 05 '24 19:06 pshipton

CUDA is only for OpenJ9, so I support updating to a newer version (e.g. 12.0).

There are other inconsistencies that should be addressed to successfully install and use whatever version we choose.

  1. the Windows ansible role looks for 9.0, before trying to install 9.1
  2. build-farm/platform-specific-configurations/windows.sh in temurin-build looks for version 9.0

It may make sense to choose the same version for Unix (which currently uses 9.0).

keithc-ca avatar Jun 05 '24 19:06 keithc-ca

@keithc-ca , I was planning to make a change to the build-farm/platform-specific-configurations/windows.sh as well. Even by following the current playbook CUDA_PATH env variable set is C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.1.

I will make the path change in checking installation status of NVidia CUDA toolkit after the change is made in build-farm/platform-specific-configurations/windows.sh ?

AswathySK avatar Jun 05 '24 19:06 AswathySK

Those changes would be in separate repositories, so (at least) two pull requests. Committers will coordinate the timing of merging them (assuming they approve).

keithc-ca avatar Jun 05 '24 20:06 keithc-ca

@steelhead31 @karianna , What are your thoughts on bumping to a newer version of Cuda toolkit?

AswathySK avatar Jun 10 '24 08:06 AswathySK

@steelhead31 @karianna , What are your thoughts on bumping to a newer version of Cuda toolkit?

Assuming the openj9 folks approve, as above, and the resultant JDK is run through the relevant test suites, prior to the changes being merged, I don't have any objections.

steelhead31 avatar Jun 10 '24 08:06 steelhead31

Assuming the openj9 folks approve, as above, and the resultant JDK is run through the relevant test suites, prior to the changes being merged, I don't have any objections.

sxa avatar Aug 01 '24 15:08 sxa

Assuming the openj9 folks approve, as above, and the resultant JDK is run through the relevant test suites, prior to the changes being merged, I don't have any objections.

Yeah +1 from me - since Adoptium does not build Temurin I'm happy to follow the suggestions from the upstream project that requires it in terms of versioning in this situation.

sxa avatar Aug 01 '24 15:08 sxa