OLMo RuntimeError: Triton Error [CUDA]: invalid device context

🐛 Describe the bug

h100-196-003:0 err: wandb: Synced 6 W&B file(s), 0 media file(s), 2 artifact file(s) and 0 other file(s) h100-196-003:0 err: Traceback (most recent call last): h100-196-003:0 err: File "/Users/H100/OLMo/scripts/train.py", line 347, in h100-196-003:0 err: main(cfg) h100-196-003:0 err: File "/Users/H100/OLMo/scripts/train.py", line 319, in main h100-196-003:0 err: trainer.fit() h100-196-003:0 err: File "/Users/H100/OLMo/olmo/train.py", line 1152, in fit h100-196-003:0 err: metrics = self.train_step(batch, reduce_global_loss=should_log_this_step) h100-196-003:0 err: File "/Users/H100/OLMo/olmo/train.py", line 781, in train_step h100-196-003:0 err: ce_batch_loss, z_batch_loss = self.train_batch(batch) h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn h100-196-003:0 err: return fn(*args, **kwargs) h100-196-003:0 err: File "/Users/H100/OLMo/olmo/train.py", line 758, in train_batch h100-196-003:0 err: loss.backward() h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward h100-196-003:0 err: torch.autograd.backward( h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/autograd/init.py", line 267, in backward h100-196-003:0 err: _engine_run_backward( h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward h100-196-003:0 err: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply h100-196-003:0 err: return user_fn(self, *args) h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 882, in backward h100-196-003:0 err: out = call_compiled_backward() h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 831, in call_compiled_backward h100-196-003:0 err: out = call_func_at_runtime_with_args( h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 113, in call_func_at_runtime_with_args h100-196-003:0 err: out = normalize_as_list(f(args)) h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn h100-196-003:0 err: return fn(*args, **kwargs) h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 36, in inner h100-196-003:0 err: return fn(*args, **kwargs) h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 906, in call h100-196-003:0 err: return self.get_current_callable()(inputs) h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 784, in run h100-196-003:0 err: return model(new_inputs) h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 934, in _run_from_cache h100-196-003:0 err: return compiled_graph.compiled_artifact(inputs) h100-196-003:0 err: File "/tmp/torchinductor_dejasu/yz/cyzj56loyzqxbsmpxbkpn2snn62qzjk6zvqc7nhgbi262jwngmlr.py", line 80, in call h100-196-003:0 err: triton_poi_fused_div_0.run(tangents_1, buf0, 1, grid=grid(1), stream=stream0) h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_inductor/triton_heuristics.py", line 670, in run h100-196-003:0 err: return launcher( h100-196-003:0 err: File "", line 7, in launcher h100-196-003:0 err: RuntimeError: Triton Error [CUDA]: invalid device context

Versions

Python 3.10.14 -e git+https://github.com/allenai/OLMo.git@4332c3224030a321c5894df18f97049b10a56582#egg=ai2_olmo ai2-olmo-core==0.1.0 aiohappyeyeballs==2.3.5 aiohttp==3.10.2 aiosignal==1.3.1 annotated-types==0.7.0 antlr4-python3-runtime==4.9.3 async-timeout==4.0.3 attrs==24.2.0 backports.tarfile==1.2.0 beaker-gantry==1.8.3 beaker-py==1.31.2 black==23.12.1 boltons==24.0.0 boto3==1.34.158 botocore==1.34.158 build==1.2.1 cached_path==1.6.3 cachetools==5.4.0 certifi==2024.7.4 cffi==1.17.0 charset-normalizer==3.3.2 click==8.1.7 click-help-colors==0.9.4 cryptography==43.0.0 datasets==2.20.0 dill==0.3.8 docker==7.1.0 docker-pycreds==0.4.0 docutils==0.21.2 exceptiongroup==1.2.2 face==20.1.1 filelock==3.13.4 frozenlist==1.4.1 fsspec==2024.5.0 ftfy==6.2.3 gitdb==4.0.11 GitPython==3.1.43 glom==23.5.0 google-api-core==2.19.1 google-auth==2.33.0 google-cloud-core==2.4.1 google-cloud-storage==2.18.2 google-crc32c==1.5.0 google-resumable-media==2.7.2 googleapis-common-protos==1.63.2 huggingface-hub==0.23.5 idna==3.7 importlib_metadata==8.2.0 importlib_resources==6.4.0 iniconfig==2.0.0 isort==5.12.0 jaraco.classes==3.4.0 jaraco.context==5.3.0 jaraco.functools==4.0.2 jeepney==0.8.0 Jinja2==3.1.4 jmespath==1.0.1 joblib==1.4.2 keyring==25.3.0 lightning-utilities==0.11.6 markdown-it-py==3.0.0 MarkupSafe==2.1.5 mdurl==0.1.2 more-itertools==10.4.0 mpmath==1.3.0 msgspec==0.18.6 multidict==6.0.5 multiprocess==0.70.16 mypy==1.3.0 mypy-extensions==1.0.0 necessary==0.4.3 networkx==3.3 nh3==0.2.18 numpy==2.0.1 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.6.20 nvidia-nvtx-cu12==12.1.105 omegaconf==2.3.0 packaging==24.1 pandas==2.2.2 pathspec==0.12.1 petname==2.6 pkginfo==1.10.0 platformdirs==4.2.2 pluggy==1.5.0 proto-plus==1.24.0 protobuf==5.27.3 psutil==6.0.0 pyarrow==17.0.0 pyarrow-hotfix==0.6 pyasn1==0.6.0 pyasn1_modules==0.4.0 pycparser==2.22 pydantic==2.8.2 pydantic_core==2.20.1 Pygments==2.18.0 pyproject_hooks==1.1.0 pytest==8.3.2 pytest-sphinx==0.6.3 python-dateutil==2.9.0.post0 pytz==2024.1 PyYAML==6.0.2 readme_renderer==44.0 regex==2024.7.24 requests==2.32.3 requests-toolbelt==1.0.0 requirements-parser==0.10.2 rfc3986==2.0.0 rich==13.7.1 rsa==4.9 ruff==0.5.7 s3transfer==0.10.2 safetensors==0.4.4 scikit-learn==1.5.1 scipy==1.14.0 SecretStorage==3.3.3 sentry-sdk==2.12.0 setproctitle==1.3.3 six==1.16.0 smart-open==7.0.4 smashed==0.21.5 smmap==5.0.1 sympy==1.13.1 threadpoolctl==3.5.0 tokenizers==0.19.1 tomli==2.0.1 torch==2.3.1 torchmetrics==1.4.1 tqdm==4.66.5 transformers==4.44.0 triton==2.3.1 trouting==0.3.3 twine==5.1.1 types-setuptools==71.1.0.20240806 typing_extensions==4.12.2 tzdata==2024.1 urllib3==2.2.2 wandb==0.17.6 wcwidth==0.2.13 wrapt==1.16.0 xxhash==3.4.1 yarl==1.9.4 zipp==3.19.2

Aug 13 '24 18:08 andymvp2018

Can you share the config file? My guess is that setting compile: null in your config could get rid of this issue.

Aug 13 '24 21:08 2015aroras

@2015aroras , I use the exact configs/official/OLMo-7B.yaml, and only modify the training data (change the training data path). So are you suggesting to add compile: null to that official Olmo-7B.yaml?

Aug 13 '24 21:08 andymvp2018

You should try replacing

compile:
  fullgraph: false

with compile:null. I don't think the compile option affects the training loss, it just the affects the throughput.

Aug 14 '24 17:08 2015aroras

Ran into the same issue. I want to use compile to drive throughput. Is there any fix for the above? Is compile not supported for OLMo models?

Aug 20 '24 07:08 RithvikKolla

Hi, thanks again for the inquiry! We’re currently working on closing out old tickets, so we’re closing this out for now, but if you require a follow-up response, please re-open and we will get back to you!

Jul 01 '25 17:07 baileykuehl