RuntimeError: Triton Error [CUDA]: invalid device context
🐛 Describe the bug
h100-196-003:0 err: wandb: Synced 6 W&B file(s), 0 media file(s), 2 artifact file(s) and 0 other file(s)
h100-196-003:0 err: Traceback (most recent call last):
h100-196-003:0 err: File "/Users/H100/OLMo/scripts/train.py", line 347, in
Versions
Python 3.10.14 -e git+https://github.com/allenai/OLMo.git@4332c3224030a321c5894df18f97049b10a56582#egg=ai2_olmo ai2-olmo-core==0.1.0 aiohappyeyeballs==2.3.5 aiohttp==3.10.2 aiosignal==1.3.1 annotated-types==0.7.0 antlr4-python3-runtime==4.9.3 async-timeout==4.0.3 attrs==24.2.0 backports.tarfile==1.2.0 beaker-gantry==1.8.3 beaker-py==1.31.2 black==23.12.1 boltons==24.0.0 boto3==1.34.158 botocore==1.34.158 build==1.2.1 cached_path==1.6.3 cachetools==5.4.0 certifi==2024.7.4 cffi==1.17.0 charset-normalizer==3.3.2 click==8.1.7 click-help-colors==0.9.4 cryptography==43.0.0 datasets==2.20.0 dill==0.3.8 docker==7.1.0 docker-pycreds==0.4.0 docutils==0.21.2 exceptiongroup==1.2.2 face==20.1.1 filelock==3.13.4 frozenlist==1.4.1 fsspec==2024.5.0 ftfy==6.2.3 gitdb==4.0.11 GitPython==3.1.43 glom==23.5.0 google-api-core==2.19.1 google-auth==2.33.0 google-cloud-core==2.4.1 google-cloud-storage==2.18.2 google-crc32c==1.5.0 google-resumable-media==2.7.2 googleapis-common-protos==1.63.2 huggingface-hub==0.23.5 idna==3.7 importlib_metadata==8.2.0 importlib_resources==6.4.0 iniconfig==2.0.0 isort==5.12.0 jaraco.classes==3.4.0 jaraco.context==5.3.0 jaraco.functools==4.0.2 jeepney==0.8.0 Jinja2==3.1.4 jmespath==1.0.1 joblib==1.4.2 keyring==25.3.0 lightning-utilities==0.11.6 markdown-it-py==3.0.0 MarkupSafe==2.1.5 mdurl==0.1.2 more-itertools==10.4.0 mpmath==1.3.0 msgspec==0.18.6 multidict==6.0.5 multiprocess==0.70.16 mypy==1.3.0 mypy-extensions==1.0.0 necessary==0.4.3 networkx==3.3 nh3==0.2.18 numpy==2.0.1 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.6.20 nvidia-nvtx-cu12==12.1.105 omegaconf==2.3.0 packaging==24.1 pandas==2.2.2 pathspec==0.12.1 petname==2.6 pkginfo==1.10.0 platformdirs==4.2.2 pluggy==1.5.0 proto-plus==1.24.0 protobuf==5.27.3 psutil==6.0.0 pyarrow==17.0.0 pyarrow-hotfix==0.6 pyasn1==0.6.0 pyasn1_modules==0.4.0 pycparser==2.22 pydantic==2.8.2 pydantic_core==2.20.1 Pygments==2.18.0 pyproject_hooks==1.1.0 pytest==8.3.2 pytest-sphinx==0.6.3 python-dateutil==2.9.0.post0 pytz==2024.1 PyYAML==6.0.2 readme_renderer==44.0 regex==2024.7.24 requests==2.32.3 requests-toolbelt==1.0.0 requirements-parser==0.10.2 rfc3986==2.0.0 rich==13.7.1 rsa==4.9 ruff==0.5.7 s3transfer==0.10.2 safetensors==0.4.4 scikit-learn==1.5.1 scipy==1.14.0 SecretStorage==3.3.3 sentry-sdk==2.12.0 setproctitle==1.3.3 six==1.16.0 smart-open==7.0.4 smashed==0.21.5 smmap==5.0.1 sympy==1.13.1 threadpoolctl==3.5.0 tokenizers==0.19.1 tomli==2.0.1 torch==2.3.1 torchmetrics==1.4.1 tqdm==4.66.5 transformers==4.44.0 triton==2.3.1 trouting==0.3.3 twine==5.1.1 types-setuptools==71.1.0.20240806 typing_extensions==4.12.2 tzdata==2024.1 urllib3==2.2.2 wandb==0.17.6 wcwidth==0.2.13 wrapt==1.16.0 xxhash==3.4.1 yarl==1.9.4 zipp==3.19.2
Can you share the config file? My guess is that setting compile: null in your config could get rid of this issue.
@2015aroras , I use the exact configs/official/OLMo-7B.yaml, and only modify the training data (change the training data path). So are you suggesting to add compile: null to that official Olmo-7B.yaml?
You should try replacing
compile:
fullgraph: false
with compile:null. I don't think the compile option affects the training loss, it just the affects the throughput.
Ran into the same issue. I want to use compile to drive throughput. Is there any fix for the above? Is compile not supported for OLMo models?
Hi, thanks again for the inquiry! We’re currently working on closing out old tickets, so we’re closing this out for now, but if you require a follow-up response, please re-open and we will get back to you!