kernl
kernl copied to clipboard
bug: Torch.dynamo is not working on H100 due to obsolete triton & pytorch
Description
Torch.dynamo is not working on H100 due to obsolete triton & pytorch
Steps to reproduce
Easily reproducible on H100 by running 'pytest -k benchmark'
Expected Behavior
Works.
Actual Behavior
Doesn't work. The issue is in old Triton (v2.0.0) which does not know anything about H100 (sm_90). Getting the following errors:
NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86.
If you want to use the NVIDIA H100 PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
This one could be solved by installing a newer Torch 2.0.1+cu118 from the suggested url.
The second one is a triton issue:
E RuntimeError: CUDA error: no kernel image is available for execution on the device
E CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
v2.0.0. has limitiation - it supports only up to < sm_90 (not including). Could not install a newer triton easily, since it complains being incompatible. However, I was able hack Triton: got it locally, synced to v2.0.0. tag and reverted the d54c04ab commit. But I am not sure it is using all SMs correctly on H100 after this surgery.
Your environment
Using Docker:
DOCKER_BUILDKIT=1 docker build -t kernl .
docker run --rm -it --gpus all -v $(pwd):/kernl kernl
Also tried the more recent NVidia Docker image (12.2.0-devel-ubuntu22.04 - same result.
Packages:
Package Version Editable project location
------------------------- ------------- -------------------------
aiohttp 3.8.5
aiosignal 1.3.1
anyio 3.7.1
appdirs 1.4.4
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
arrow 1.2.3
asttokens 2.2.1
async-lru 2.0.3
async-timeout 4.0.2
attrs 23.1.0
audioread 3.0.0
Babel 2.12.1
backcall 0.2.0
beautifulsoup4 4.12.2
black 23.7.0
bleach 6.0.0
blinker 1.4
certifi 2023.7.22
cffi 1.15.1
charset-normalizer 3.2.0
click 8.1.6
cmake 3.27.0
comm 0.1.3
cryptography 3.4.8
datasets 2.14.0
dbus-python 1.2.18
debugpy 1.6.7
decorator 5.1.1
defusedxml 0.7.1
dill 0.3.7
distro 1.7.0
distro-info 1.1build1
exceptiongroup 1.1.2
executing 1.2.0
fastjsonschema 2.18.0
filelock 3.12.2
flake8 6.0.0
fqdn 1.5.1
frozenlist 1.4.0
fsspec 2023.6.0
httplib2 0.20.2
huggingface-hub 0.16.4
idna 3.4
importlib-metadata 6.8.0
iniconfig 2.0.0
ipykernel 6.25.0
ipython 8.14.0
ipython-genutils 0.2.0
ipywidgets 8.0.7
isoduration 20.11.0
isort 5.12.0
jedi 0.18.2
jeepney 0.7.1
Jinja2 3.1.2
joblib 1.3.1
json5 0.9.14
jsonpointer 2.4
jsonschema 4.18.4
jsonschema-specifications 2023.7.1
jupyter 1.0.0
jupyter_client 8.3.0
jupyter-console 6.6.3
jupyter_core 5.3.1
jupyter-events 0.6.3
jupyter-lsp 2.2.0
jupyter_server 2.7.0
jupyter_server_terminals 0.4.4
jupyterlab 4.0.3
jupyterlab-pygments 0.2.2
jupyterlab_server 2.24.0
jupyterlab-widgets 3.0.8
kernl 0.2.2 /kernl/src
keyring 23.5.0
launchpadlib 1.10.16
lazr.restfulclient 0.14.4
lazr.uri 1.0.6
lazy_loader 0.3
librosa 0.10.0.post2
lit 16.0.6
llvmlite 0.40.1
MarkupSafe 2.1.3
matplotlib-inline 0.1.6
mccabe 0.7.0
mistune 3.0.1
more-itertools 8.10.0
mpmath 1.3.0
msgpack 1.0.5
multidict 6.0.4
multiprocess 0.70.15
mypy-extensions 1.0.0
nbclient 0.8.0
nbconvert 7.7.3
nbformat 5.9.1
nest-asyncio 1.5.6
networkx 3.1
notebook 7.0.0
notebook_shim 0.2.3
numba 0.57.1
numpy 1.24.4
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.2.10.91
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusparse-cu11 11.7.4.91
nvidia-nccl-cu11 2.14.3
nvidia-nvtx-cu11 11.7.91
oauthlib 3.2.0
overrides 7.3.1
packaging 23.1
pandas 2.0.3
pandocfilters 1.5.0
parso 0.8.3
pathspec 0.11.1
pexpect 4.8.0
pickleshare 0.7.5
pip 23.2.1
platformdirs 3.9.1
pluggy 1.2.0
pooch 1.6.0
prometheus-client 0.17.1
prompt-toolkit 3.0.39
psutil 5.9.5
ptyprocess 0.7.0
pure-eval 0.2.2
pyarrow 12.0.1
pycodestyle 2.10.0
pycparser 2.21
pyflakes 3.0.1
Pygments 2.15.1
PyGObject 3.42.1
PyJWT 2.3.0
pyparsing 2.4.7
pytest 7.4.0
python-apt 2.4.0+ubuntu1
python-dateutil 2.8.2
python-json-logger 2.0.7
pytz 2023.3
PyYAML 6.0.1
pyzmq 25.1.0
qtconsole 5.4.3
QtPy 2.3.1
referencing 0.30.0
regex 2023.6.3
requests 2.31.0
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
rpds-py 0.9.2
safetensors 0.3.1
scikit-learn 1.3.0
scipy 1.11.1
SecretStorage 3.3.1
Send2Trash 1.8.2
setuptools 58.1.0
six 1.16.0
sniffio 1.3.0
soundfile 0.12.1
soupsieve 2.4.1
soxr 0.3.5
stack-data 0.6.2
sympy 1.12
tabulate 0.9.0
termcolor 2.3.0
terminado 0.17.1
threadpoolctl 3.2.0
tinycss2 1.2.1
tokenize-rt 5.1.0
tokenizers 0.13.3
tomli 2.0.1
torch 2.0.0
tornado 6.3.2
tqdm 4.65.0
traitlets 5.9.0
transformers 4.31.0
triton 2.0.0
typing_extensions 4.7.1
tzdata 2023.3
unattended-upgrades 0.1
uri-template 1.3.0
urllib3 2.0.4
wadllib 1.3.6
wcwidth 0.2.6
webcolors 1.13
webencodings 0.5.1
websocket-client 1.6.1
wheel 0.41.0
widgetsnbextension 4.0.8
xxhash 3.2.0
yarl 1.9.2
zipp 1.0.0
Self-service
- [ ] I would be willing to help fix this bug myself.
Code of Conduct
- [X] I agree to follow this project's Code of Conduct