RuntimeError: CUDA error: no kernel image is available for execution on the device
How can i fix this error? I ran the command: torchrun --nproc_per_node=1 perf.py --msa-length 128 --res-length 256. Then the following error appeared. The versions of Pytorch, Python, and CUDA are 1.10, 3.8, and 11.3, respectively.
Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total 1.
initialize tensor model parallel with size 1 initialize data parallel with size 1 Traceback (most recent call last): File "perf.py", line 191, in
main() File "perf.py", line 156, in main layer_inputs = attn_layers[lyr_idx].forward(*layer_inputs, node_mask, pair_mask) File "/root/miniconda3/envs/myconda/lib/python3.8/site-packages/fastfold-0.1.0b0-py3.8-linux-x86_64.egg/fastfold/model/evoformer.py", line 17, in forward node = self.msa_stack(node, pair, node_mask) File "/root/miniconda3/envs/myconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/envs/myconda/lib/python3.8/site-packages/fastfold-0.1.0b0-py3.8-linux-x86_64.egg/fastfold/model/msa.py", line 99, in forward node = self.MSARowAttentionWithPairBias(node, pair, node_mask_row) File "/root/miniconda3/envs/myconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/envs/myconda/lib/python3.8/site-packages/fastfold-0.1.0b0-py3.8-linux-x86_64.egg/fastfold/model/msa.py", line 43, in forward Z = self.layernormZ(Z) File "/root/miniconda3/envs/myconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/envs/myconda/lib/python3.8/site-packages/fastfold-0.1.0b0-py3.8-linux-x86_64.egg/fastfold/model/kernel/cuda_native/layer_norm.py", line 69, in forward return FusedLayerNormAffineFunction.apply(input, self.weight, self.bias, File "/root/miniconda3/envs/myconda/lib/python3.8/site-packages/fastfold-0.1.0b0-py3.8-linux-x86_64.egg/fastfold/model/kernel/cuda_native/layer_norm.py", line 22, in forward output, mean, invvar = fastfold_layer_norm_cuda.forward_affine( RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 132) of binary: /root/miniconda3/envs/myconda/bin/python Traceback (most recent call last): File "/root/miniconda3/envs/myconda/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.10.0+cu113', 'console_scripts', 'torchrun')()) File "/root/miniconda3/envs/myconda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/root/miniconda3/envs/myconda/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main run(args) File "/root/miniconda3/envs/myconda/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/root/miniconda3/envs/myconda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/myconda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ perf.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2022-03-15_10:18:15 host : 69f885408067 rank : 0 (local_rank: 0) exitcode : 1 (pid: 132) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
This is most likely because we only specified compute_70,code=sm_70 and compute_80,code=sm_80 when we compiled the cuda module. Could you please provide the hardware information of your machine (gpu).