[QUESTION] RuntimeError: CUDA error: no kernel image is available for execution on the device
Your question: I'm trying to run a GPT pre-training script using examples/pretrain_gpt.sh and the following is the error message I get.
Traceback (most recent call last):
File "/workspace/megatron/pretrain_gpt.py", line 230, in <module>
pretrain(train_valid_test_datasets_provider,
File "/workspace/megatron/megatron/training.py", line 180, in pretrain
iteration = train(forward_step_func,
File "/workspace/megatron/megatron/training.py", line 784, in train
train_step(forward_step_func,
File "/workspace/megatron/megatron/training.py", line 447, in train_step
losses_reduced = forward_backward_func(
File "/workspace/megatron/megatron/core/pipeline_parallel/schedules.py", line 327, in forward_backward_no_pipelining
output_tensor = forward_step(
File "/workspace/megatron/megatron/core/pipeline_parallel/schedules.py", line 183, in forward_step
output_tensor, loss_func = forward_step_func(data_iterator, model)
File "/workspace/megatron/pretrain_gpt.py", line 181, in forward_step
output_tensor = model(tokens, position_ids, attention_mask,
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/megatron/megatron/core/distributed/distributed_data_parallel.py", line 136, in forward
return self.module(*inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/megatron/megatron/model/module.py", line 181, in forward
outputs = self.module(*inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/megatron/megatron/model/gpt_model.py", line 82, in forward
lm_output = self.language_model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/megatron/megatron/model/language_model.py", line 493, in forward
encoder_output = self.encoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/megatron/megatron/model/transformer.py", line 1761, in forward
hidden_states = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/megatron/megatron/model/transformer.py", line 1150, in forward
self.self_attention(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/megatron/megatron/model/transformer.py", line 666, in forward
mixed_x_layer, _ = self.query_key_value(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/megatron/megatron/core/tensor_parallel/layers.py", line 747, in forward
output_parallel = self._forward_impl(
File "/workspace/megatron/megatron/core/tensor_parallel/layers.py", line 528, in linear_with_grad_accumulation_and_async_allreduce
return LinearWithGradAccumulationAndAsyncCommunication.apply(*args)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 539, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/autocast_mode.py", line 113, in decorate_fwd
return fwd(*args, **kwargs)
File "/workspace/megatron/megatron/core/tensor_parallel/layers.py", line 336, in forward
output = torch.matmul(total_input, weight.t())
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[2023-12-07 20:27:40,447] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 400) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I am using an NVIDIA Docker (nvcr.io/nvidia/pytorch:23.10-py3) and my GPU device is NVIDIA Titan Xp. I also tried with multiple different NVIDIA Docker versions but none of them works. With the docker versions of >=pytorch:23.03, I get the above error message and for docker versions of <= pytorch:23.03, I get the below error message:
Traceback (most recent call last):
File "pretrain_gpt.py", line 9, in <module>
from megatron import get_args
File "/workspace/megatron/megatron/__init__.py", line 15, in <module>
from .initialize import initialize_megatron
File "/workspace/megatron/megatron/initialize.py", line 18, in <module>
from megatron.arguments import parse_args, validate_args
File "/workspace/megatron/megatron/arguments.py", line 16, in <module>
from megatron.core.models.retro import RetroConfig
File "/workspace/megatron/megatron/core/models/retro/__init__.py", line 4, in <module>
from .decoder_spec import get_retro_decoder_block_spec
File "/workspace/megatron/megatron/core/models/retro/decoder_spec.py", line 5, in <module>
from megatron.core.models.gpt.gpt_layer_specs import (
File "/workspace/megatron/megatron/core/models/gpt/__init__.py", line 1, in <module>
from .gpt_model import GPTModel
File "/workspace/megatron/megatron/core/models/gpt/gpt_model.py", line 15, in <module>
from megatron.core.transformer.transformer_block import TransformerBlock
File "/workspace/megatron/megatron/core/transformer/transformer_block.py", line 13, in <module>
from megatron.core.transformer.custom_layers.transformer_engine import TENorm
File "/workspace/megatron/megatron/core/transformer/custom_layers/transformer_engine.py", line 337, in <module>
class TEDotProductAttention(te.pytorch.DotProductAttention):
AttributeError: module 'transformer_engine.pytorch' has no attribute 'DotProductAttention'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 386) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.14.0a0+44dac51', 'console_scripts', 'torchrun')())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I tried to manually reinstall transformer_engine from source, but it wasn't able to successfully install. Here is my system environment when I run with NVIDIA Docker (nvcr.io/nvidia/pytorch:23.10-py3):
PyTorch version: 2.1.0a0+32f93b1
Is debug build: False
CUDA used to build PyTorch: 12.2
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.27.6
Libc version: glibc-2.35
Python version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-89-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA TITAN Xp
GPU 1: NVIDIA TITAN Xp
Nvidia driver version: 535.129.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.22.2
[pip3] pytorch-quantization==2.1.2
[pip3] torch==2.1.0a0+32f93b1
[pip3] torch-tensorrt==0.0.0
[pip3] torchdata==0.7.0a0
[pip3] torchtext==0.16.0a0
[pip3] torchvision==0.16.0a0
[pip3] triton==2.1.0+e621604
Any help would be very much appreciated. Thank you.
Have you solved this problem? I have encountered the same issue.
No, I have not yet
Marking as stale. No activity in 60 days.
+1
the environment of docker container cause this problem ,please use newest container from ngc
Marking as stale. No activity in 60 days.
maybe some thing mismatch between your gpu and pytorch,try to build env locally by torch==2.1 like bellow:
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
pip install 'numpy<2'
pip install psutil
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
then try to run train script