sheffield_hpc H100 Torch support

H100 Torch support

Open ptheywood opened this issue 1 year ago • 3 comments

The stanage pytorch documentation currently states that nightly builds must be used on the h100 nodes.

From pypi, pytorch 2.1.0+ (October 4th 2023) is a CUDA 12.1 build which supports sm_90. This is the default build form pypi/conda now, so someone just generically asking for pytorch will get a working version.

From the pytorch cu118 wheelhouse (https://download.pytorch.org/whl/cu118), pytorch 2.0.0+ supports sm_90. From the cu121 wheelhouse (https://download.pytorch.org/whl/cu121) pytorch 2.1.0+ supports sm_90.

Conda any build available from pytorch-cuda=11.8 or pytorch-cuda=12.1 work, i.e. 2.0.0+.

The final cuda 1.y.z build was cuda 11.7 only, which can't support SM_90.

The note on the stanage torch page about h100's can probably be changed to reflect this, although it will be worth testing atleast one of these builds on an h100 node to ensure it does actually run device code.

Edit: Farhad has ran 2.1.0 on the H100s successfully, so probably safe to make this change

Something along the lines of (but i'm not super happy with this):

.. note:: 

   The H100 GPU nodes in Stanage ((see :ref:`Stanage specs <stanage-gpu-specs>`)) require torch >= 2.0.0 built using CUDA 11.8 or newer. 

   * Torch >= 2.1.0 can be installed using pip from pypi or https://download.pytorch.org/whl/cu121, or using conda with `pytorch-cuda="12.1"`.
   * Torch >= 2.0.0 can be installed using pip from https://download.pytorch.org/whl/cu118 or conda with `pytorch-cuda="11.8"`
   * Torch < 2.0.0 is not compatible with the H100 GPUs.

   For more information on how to install pytorch using CUDA >= 11.8, see the torch documentation.

Compute capabillity checking

I've only checked for the presence of sm_90 in the compute capability flags, and not checked for actual compatibility.

pypi

From pypi, only 2.1.0+ fetches a cuda >= 11.8+ build.

$ python3 -m pip install torch
$ python3 -c "import torch;print(torch.__version__);print(torch.cuda.get_arch_list())" 
2.1.0+cu121
['sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90']

$ python3 -m pip install torch==2.0.1 --no-cache-dir
$ python3 -c "import torch;print(torch.__version__);print(torch.cuda.get_arch_list())" 
python3 -c "import torch; print(torch.__version__);print(torch.cuda.get_arch_list())" 
2.0.1+cu117
['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86']

pip, download.pytorch.org/whl/cu*

CUDA 11.8 wheels from https://download.pytorch.org/whl/cu118:

$ python3 -m pip install torch==2.1.0 --index-url https://download.pytorch.org/whl/cu118
$ python3 -c "import torch;print(torch.__version__);print(torch.cuda.get_arch_list())" 
2.1.0+cu118
['sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_37', 'sm_90']

$ python3 -m pip install torch==2.0.1 --index-url https://download.pytorch.org/whl/cu118
$ python3 -c "import torch;print(torch.__version__);print(torch.cuda.get_arch_list())" 
2.0.1+cu118
['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90']

$ python3 -m pip install torch==2.0.0 --index-url https://download.pytorch.org/whl/cu118
$ python3 -c "import torch;print(torch.__version__);print(torch.cuda.get_arch_list())" 
2.0.0+cu118
['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90']

conda

Not checking all verisons, it takes a while, but importantly the latest, latest explicit, and oldest expected ones.

$ conda install pytorch pytorch-cuda -c pytorch -c nvidia

$ conda install pytorch==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia
$ python3 -c "import torch;print(torch.__version__);print(torch.cuda.get_arch_list())" 
2.1.0                                                                                                                                      
['sm_50', 'sm_60', 'sm_61', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90']

$ conda install pytorch==2.1.0 pytorch-cuda=11.8 -c pytorch -c python3 -c "import torch; print(torch.__version__);print(torch.cuda.get_arch_list())"                                                                                             
2.1.0                                                                                                                                      
['sm_50', 'sm_60', 'sm_61', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_37', 'sm_90', 'compute_37']

$ conda install pytorch==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia
$ python3 -c "import torch; print(torch.__version__);print(torch.cuda.get_arch_list())" 
2.0.0                                                                                                                                      
['sm_37', 'sm_50', 'sm_60', 'sm_61', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90', 'compute_37']

11.7 builds don't include sm_90 (they can't). :

$ conda install pytorch==2.0.0 pytorch-cuda=11.7 -c pytorch -c nvidia                                                     
2.0.0                                                                                                                                      
['sm_37', 'sm_50', 'sm_60', 'sm_61', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'compute_37']

Interestingly conda builds include for CUDA <12 embed ptx for sm_37, but newer builds do not embed any ptx, so a conda build for a post-hopper GPU (SM_10X? won't work). Though PTX jitting for such a large library will take a very long time.

Oct 19 '23 11:10 ptheywood

sheffield_hpc sheffield_hpc copied to clipboard

H100 Torch support

Compute capabillity checking

pypi

pip, download.pytorch.org/whl/cu*

conda

sheffield_hpc
sheffield_hpc copied to clipboard