kernl icon indicating copy to clipboard operation
kernl copied to clipboard

bug: Torch.dynamo is not working on H100 due to obsolete triton & pytorch

Open Artyom17 opened this issue 1 year ago • 0 comments

Description

Torch.dynamo is not working on H100 due to obsolete triton & pytorch

Steps to reproduce

Easily reproducible on H100 by running 'pytest -k benchmark'

Expected Behavior

Works.

Actual Behavior

Doesn't work. The issue is in old Triton (v2.0.0) which does not know anything about H100 (sm_90). Getting the following errors:

  NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation.
  The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86.
  If you want to use the NVIDIA H100 PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

This one could be solved by installing a newer Torch 2.0.1+cu118 from the suggested url.

The second one is a triton issue:

E       RuntimeError: CUDA error: no kernel image is available for execution on the device
E       CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

v2.0.0. has limitiation - it supports only up to < sm_90 (not including). Could not install a newer triton easily, since it complains being incompatible. However, I was able hack Triton: got it locally, synced to v2.0.0. tag and reverted the d54c04ab commit. But I am not sure it is using all SMs correctly on H100 after this surgery.

Your environment

Using Docker:

DOCKER_BUILDKIT=1 docker build -t kernl .
docker run --rm -it --gpus all -v $(pwd):/kernl kernl

Also tried the more recent NVidia Docker image (12.2.0-devel-ubuntu22.04 - same result.

Packages:

Package                   Version       Editable project location
------------------------- ------------- -------------------------
aiohttp                   3.8.5
aiosignal                 1.3.1
anyio                     3.7.1
appdirs                   1.4.4
argon2-cffi               21.3.0
argon2-cffi-bindings      21.2.0
arrow                     1.2.3
asttokens                 2.2.1
async-lru                 2.0.3
async-timeout             4.0.2
attrs                     23.1.0
audioread                 3.0.0
Babel                     2.12.1
backcall                  0.2.0
beautifulsoup4            4.12.2
black                     23.7.0
bleach                    6.0.0
blinker                   1.4
certifi                   2023.7.22
cffi                      1.15.1
charset-normalizer        3.2.0
click                     8.1.6
cmake                     3.27.0
comm                      0.1.3
cryptography              3.4.8
datasets                  2.14.0
dbus-python               1.2.18
debugpy                   1.6.7
decorator                 5.1.1
defusedxml                0.7.1
dill                      0.3.7
distro                    1.7.0
distro-info               1.1build1
exceptiongroup            1.1.2
executing                 1.2.0
fastjsonschema            2.18.0
filelock                  3.12.2
flake8                    6.0.0
fqdn                      1.5.1
frozenlist                1.4.0
fsspec                    2023.6.0
httplib2                  0.20.2
huggingface-hub           0.16.4
idna                      3.4
importlib-metadata        6.8.0
iniconfig                 2.0.0
ipykernel                 6.25.0
ipython                   8.14.0
ipython-genutils          0.2.0
ipywidgets                8.0.7
isoduration               20.11.0
isort                     5.12.0
jedi                      0.18.2
jeepney                   0.7.1
Jinja2                    3.1.2
joblib                    1.3.1
json5                     0.9.14
jsonpointer               2.4
jsonschema                4.18.4
jsonschema-specifications 2023.7.1
jupyter                   1.0.0
jupyter_client            8.3.0
jupyter-console           6.6.3
jupyter_core              5.3.1
jupyter-events            0.6.3
jupyter-lsp               2.2.0
jupyter_server            2.7.0
jupyter_server_terminals  0.4.4
jupyterlab                4.0.3
jupyterlab-pygments       0.2.2
jupyterlab_server         2.24.0
jupyterlab-widgets        3.0.8
kernl                     0.2.2         /kernl/src
keyring                   23.5.0
launchpadlib              1.10.16
lazr.restfulclient        0.14.4
lazr.uri                  1.0.6
lazy_loader               0.3
librosa                   0.10.0.post2
lit                       16.0.6
llvmlite                  0.40.1
MarkupSafe                2.1.3
matplotlib-inline         0.1.6
mccabe                    0.7.0
mistune                   3.0.1
more-itertools            8.10.0
mpmath                    1.3.0
msgpack                   1.0.5
multidict                 6.0.4
multiprocess              0.70.15
mypy-extensions           1.0.0
nbclient                  0.8.0
nbconvert                 7.7.3
nbformat                  5.9.1
nest-asyncio              1.5.6
networkx                  3.1
notebook                  7.0.0
notebook_shim             0.2.3
numba                     0.57.1
numpy                     1.24.4
nvidia-cublas-cu11        11.10.3.66
nvidia-cuda-cupti-cu11    11.7.101
nvidia-cuda-nvrtc-cu11    11.7.99
nvidia-cuda-runtime-cu11  11.7.99
nvidia-cudnn-cu11         8.5.0.96
nvidia-cufft-cu11         10.9.0.58
nvidia-curand-cu11        10.2.10.91
nvidia-cusolver-cu11      11.4.0.1
nvidia-cusparse-cu11      11.7.4.91
nvidia-nccl-cu11          2.14.3
nvidia-nvtx-cu11          11.7.91
oauthlib                  3.2.0
overrides                 7.3.1
packaging                 23.1
pandas                    2.0.3
pandocfilters             1.5.0
parso                     0.8.3
pathspec                  0.11.1
pexpect                   4.8.0
pickleshare               0.7.5
pip                       23.2.1
platformdirs              3.9.1
pluggy                    1.2.0
pooch                     1.6.0
prometheus-client         0.17.1
prompt-toolkit            3.0.39
psutil                    5.9.5
ptyprocess                0.7.0
pure-eval                 0.2.2
pyarrow                   12.0.1
pycodestyle               2.10.0
pycparser                 2.21
pyflakes                  3.0.1
Pygments                  2.15.1
PyGObject                 3.42.1
PyJWT                     2.3.0
pyparsing                 2.4.7
pytest                    7.4.0
python-apt                2.4.0+ubuntu1
python-dateutil           2.8.2
python-json-logger        2.0.7
pytz                      2023.3
PyYAML                    6.0.1
pyzmq                     25.1.0
qtconsole                 5.4.3
QtPy                      2.3.1
referencing               0.30.0
regex                     2023.6.3
requests                  2.31.0
rfc3339-validator         0.1.4
rfc3986-validator         0.1.1
rpds-py                   0.9.2
safetensors               0.3.1
scikit-learn              1.3.0
scipy                     1.11.1
SecretStorage             3.3.1
Send2Trash                1.8.2
setuptools                58.1.0
six                       1.16.0
sniffio                   1.3.0
soundfile                 0.12.1
soupsieve                 2.4.1
soxr                      0.3.5
stack-data                0.6.2
sympy                     1.12
tabulate                  0.9.0
termcolor                 2.3.0
terminado                 0.17.1
threadpoolctl             3.2.0
tinycss2                  1.2.1
tokenize-rt               5.1.0
tokenizers                0.13.3
tomli                     2.0.1
torch                     2.0.0
tornado                   6.3.2
tqdm                      4.65.0
traitlets                 5.9.0
transformers              4.31.0
triton                    2.0.0
typing_extensions         4.7.1
tzdata                    2023.3
unattended-upgrades       0.1
uri-template              1.3.0
urllib3                   2.0.4
wadllib                   1.3.6
wcwidth                   0.2.6
webcolors                 1.13
webencodings              0.5.1
websocket-client          1.6.1
wheel                     0.41.0
widgetsnbextension        4.0.8
xxhash                    3.2.0
yarl                      1.9.2
zipp                      1.0.0

Self-service

  • [ ] I would be willing to help fix this bug myself.

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

Artyom17 avatar Jul 25 '23 22:07 Artyom17