pytorch RuntimeError: CUDA error: unspecified launch failure

Issue description

RuntimeError: CUDA error: unspecified launch failure Error occurring on any training script. Occurrence is not deterministic. Can occur at anytime during the course of training. All the codes work fine on RTX 3090.

/lib/python3.8/site-packages/torch/autograd/init.py Variable._execution_engine.run_backward( RuntimeError: CUDA error: unspecified launch failure

System Info

PyTorch version: 1.10.2+cu113 Is debug build: False CUDA used to build PyTorch: 11.3 ROCM used to build PyTorch: N/A

OS: Pop!_OS 20.04 LTS (x86_64) GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.31

Python version: 3.8.10 (default, Nov 26 2021, 20:14:08) [GCC 9.3.0] (64-bit runtime) Python platform: Linux-5.15.15-76051515-generic-x86_64-with-glibc2.29 Is CUDA available: True CUDA runtime version: 11.2.67 GPU models and configuration: GPU 0: NVIDIA RTX A6000 GPU 1: NVIDIA RTX A6000

Nvidia driver version: 470.86 cuDNN version: Probably one of the following: /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn.so.8.1.1 /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.1.1 /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.1.1 /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.1.1 /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.1.1 /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.1.1 /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.1.1 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] numpy==1.22.3 [pip3] torch==1.10.2+cu113 [pip3] torchaudio==0.10.2+cu113 [pip3] torchvision==0.11.3+cu113 [conda] Could not collect


- PyTorch or Caffe2: PyTorch
- How you installed PyTorch (conda, pip, source): pip
- Build command you used (if compiling from source):
- OS: Pop OS 20.04
- PyTorch version: '1.10.2+cu113'
- Python version: 3.8
- CUDA/cuDNN version: 11.3
- GPU models and configuration: A6000
- GCC version (if compiling from source):
- CMake version:
- Versions of any other relevant libraries:


cc @csarofeen @ptrblck @xwang233 @ngimel

Mar 15 '22 09:03 prabhatkumar95

Hi, do you have a code snippet which we can use to reproduce the issue?

Mar 15 '22 20:03 H-Huang

@anidh

Mar 16 '22 05:03 prabhatkumar95

Hi @H-Huang

I'm using the ultralytics yolov5 repo to train the model. The command which I'm using to train the model is python train.py --img 640 --batch 32 --epochs 400 --data idd.yaml --weights yolov5x.pt --rect --image-weights --evolve --device 0,1 --multi-scale --name demo-img --patience 30 --save-period 1 --worker 22 --quad The error is very random and can happen at the very 1st epoch or can happen at the 10th epoch and there is no certain way to know when it'll happen. The CPU which I'm using is AMD Ryzen ThreadRripper PRO 3975WX.

Mar 16 '22 11:03 anidh

Error encountered when replacing the A6000 with RTX3090

RuntimeError: CUDA error: unspecified launch failure CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: unspecified launch failure CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from record at /home/avlabs_blue/pytorch/aten/src/ATen/cuda/CUDAEvent.h:119 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7f495477d0ac in /home/avlabs_blue/.virtualenvs/ultralytics/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0xed2b62 (0x7f4881e3bb62 in /home/avlabs_blue/.virtualenvs/ultralytics/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xed67d6 (0x7f4881e3f7d6 in /home/avlabs_blue/.virtualenvs/ultralytics/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #3: + 0x43c3fc (0x7f494d8493fc in /home/avlabs_blue/.virtualenvs/ultralytics/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f4954765f35 in /home/avlabs_blue/.virtualenvs/ultralytics/lib/python3.8/site-packages/torch/lib/libc10.so) frame #5: + 0x339449 (0x7f494d746449 in /home/avlabs_blue/.virtualenvs/ultralytics/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #6: + 0x645dc2 (0x7f494da52dc2 in /home/avlabs_blue/.virtualenvs/ultralytics/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #7: THPVariable_subclass_dealloc(_object*) + 0x2f5 (0x7f494da53145 in /home/avlabs_blue/.virtualenvs/ultralytics/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #8: python() [0x5eccdb] frame #9: python() [0x5aee8a] frame #10: python() [0x613925] frame #11: python() [0x5d1e78] frame #12: python() [0x5a958d] frame #13: python() [0x5ed1a0] frame #14: python() [0x544188] frame #15: python() [0x5441da] frame #16: python() [0x5441da] frame #22: __libc_start_main + 0xf3 (0x7f495c8be0b3 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

Environement Dertails Collecting environment information... PyTorch version: 1.12.0a0+gitd5744f4 Is debug build: False CUDA used to build PyTorch: 11.2 ROCM used to build PyTorch: N/A

OS: Pop!_OS 20.04 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.31

Python version: 3.8.10 (default, Nov 26 2021, 20:14:08) [GCC 9.3.0] (64-bit runtime) Python platform: Linux-5.16.11-76051611-generic-x86_64-with-glibc2.29 Is CUDA available: True CUDA runtime version: 11.2.67 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 Nvidia driver version: 510.54 cuDNN version: Probably one of the following: /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn.so.8.1.1 /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.1.1 /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.1.1 /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.1.1 /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.1.1 /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.1.1 /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.1.1 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] numpy==1.22.3 [pip3] torch==1.12.0a0+gitd5744f4 [pip3] torchaudio==0.10.2+cu113 [pip3] torchvision==0.13.0a0+00c119c [conda] magma-cuda110 2.5.2 1 pytorch [conda] mkl 2022.0.1 h06a4308_117
[conda] mkl-include 2022.0.1 h06a4308_117
[conda] numpy 1.21.2 py38hd8d4704_0
[conda] numpy-base 1.21.2 py38h2b8c604_0
[conda] torch 1.12.0a0+gitd5744f4 dev_0 (ultralytics) avlabs_blue@pop-os:/mnt/storage$

Mar 17 '22 06:03 prabhatkumar95

Similar to the issue here

But needs reopening and urgency

Mar 17 '22 09:03 prabhatkumar95

A much simplified code that can be run is repo

We run the command CUDA_LAUNCH_BLOCKING=1 python train_cifar10.py --net res101 --bs 256

File "/home/avlabs_blue/.virtualenvs/ultralytics/lib/python3.8/site-packages/torch/_tensor.py", line 399, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/avlabs_blue/.virtualenvs/ultralytics/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.benchmark = True torch.backends.cudnn.deterministic = False torch.backends.cudnn.allow_tf32 = True data = torch.randn([256, 256, 8, 8], dtype=torch.float, device='cuda', requires_grad=True) net = torch.nn.Conv2d(256, 256, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1) net = net.cuda().float() out = net(data) out.backward(torch.randn_like(out)) torch.cuda.synchronize()

ConvolutionParams data_type = CUDNN_DATA_FLOAT padding = [1, 1, 0] stride = [1, 1, 0] dilation = [1, 1, 0] groups = 1 deterministic = false allow_tf32 = true input: TensorDescriptor 0xa5c1afc0 type = CUDNN_DATA_FLOAT nbDims = 4 dimA = 256, 256, 8, 8, strideA = 16384, 64, 8, 1, output: TensorDescriptor 0xa7ab5f40 type = CUDNN_DATA_FLOAT nbDims = 4 dimA = 256, 256, 8, 8, strideA = 16384, 64, 8, 1, weight: FilterDescriptor 0x7f44a46edba0 type = CUDNN_DATA_FLOAT tensor_format = CUDNN_TENSOR_NCHW nbDims = 4 dimA = 256, 256, 3, 3, Pointer addresses: input: 0x7f4523800000 output: 0x7f4524800000 weight: 0x7f4613e80000 Additional pointer addresses: grad_output: 0x7f4524800000 grad_weight: 0x7f4613e80000 Backward filter algorithm: 5

Mar 17 '22 10:03 prabhatkumar95

Hi @prabhatkumar95 , thanks for reporting this issue. I saw the pytorch version you have is

PyTorch version: 1.12.0a0+gitd5744f4
Is debug build: False
CUDA used to build PyTorch: 11.2

Did you build pytorch from source? CUDA 11.2 is not recommended. I'd suggest you try with the latest CUDA 11.6 and CUDNN 8.3.3 if you prefer to build from source, or a pre-built pip wheel with CUDA 11.5 on https://pytorch.org/get-started/locally/ (just replace every 113 with 115 in the link).

Apr 01 '22 00:04 xwang233

Hi @zwang233 I tried building from source as well as using docker from here but still is the same error.

May 08 '22 19:05 prabhatkumar95

Hey there. I have had the same issue. This has been happening consistently across different code-bases (After random number of epochs I get a CUDA error. Most commonly with "unspecifed launch failure"). The same code runs however completely fine on my friends PC although he is also using a 3090 gpu. My first thought was then that my GPU might be broken. So we switched GPU's. But nothing changed. It runs on his but not on mine. I installed pytorch the recommended way https://pytorch.org/get-started/locally/. I originally come from this issue: https://github.com/pytorch/pytorch/issues/27837 where it suggested in the end that it might be related to using AMD processors. Indeed I have an ADM-processor and the PC of my friend uses an Intel cpu. Can anyone deny or support this theory? Any help is very welcomed. I have tried debugging this error for the past weeks without success. My only hope would now be, that its actually amd-cpu related. However I would like to be really sure before exchanging my Cpu and motherboard for only that reason

Jul 23 '22 16:07 municola

Hi @municola, I have Intel® Core™ i7-10700K CPU @ 3.80GHz × 16 and RTX 3090, and last night I haved encountered this error. I didn't have it in for the same code repo before, still don't know if it comes from my code change or some random factors.

Jul 27 '22 02:07 chyelang

I also experienced these issues across various projects (all using torch), using a machine with 4x 3090s and an AMD Ryzen Threadripper 3960X 24-Core Processor, NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6.

The only pattern I noticed repeatedly is that it seems to happen more often when I use more than 2 GPUs. Whenever I only used 2x GPUs, it hasn't happened yet. Besides that, it occurs at random times and across different repos (e.g. using different torch versions) and projects (e.g. in both computer vision and NLP pipelines).

Sep 07 '22 15:09 JeanKaddour

Also experiencing this issue on an RTX 4090!

Same exact scenario as people described here #27837, where the error always occurs randomly anywhere from 20 minutes to 20 hours into training.

Sometimes
unspecified launch failure, and sometimes
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP).

I have tried each of the following and haven't been able to eliminate this error yet:

OS: Both native Ubuntu and also WSL
Pytorch: Nightly (2.0.0.dev20230226+cu118), and manually building from source with cuda 12
Driver: 520.56 and 525.89

Even though this is brand new hardware I've tried to rule out any potential hardware issue, so I have also tested:

GPU power limit reduced to 50%
Application memory usage reduced 50%
Running gpu-burner with no errors

The logs running dmesg shows Xid error 109:

NVRM: Xid (PCI:0000:01:00): 109, pid=4124, name=python, Ch 00000028, errorString CTX SWITCH TIMEOUT, Info 0x2c014

Xid 109 does not even exist in nvidia's documentation??

Any insight on this issue would be greatly appreciated, thanks!

Feb 27 '23 05:02 PWhiddy

In my case, it seems the issue may have had to do with pinned memory. After disabling it I have not yet encountered the error. If anyone tries this please let me know if it fixed it for you.

Mar 01 '23 03:03 PWhiddy

@PWhiddy I am facing a simliar issue while training yolov8 for segmentation on a custom dataset. I have been trying to solve this issue for past week but still unable to resolve it. Could you point out how you resolved it by disabling pinned memory? Was this a change in CUDA pinned memory or the pytorch one? Thanks & Regards

Apr 17 '23 13:04 rajpalaakash

@rajpalaakash try running with cuda launch blocking to see if there’s a more specific error.

Apr 17 '23 16:04 PWhiddy

I'm encountering the same issue when running my code on an RTX 4090 GPU. Interestingly, the code executes without any errors on both RTX 2080 Ti and A6000 GPUs. The error appears after a random number of epochs, interrupting the training process.

Apr 26 '23 12:04 tae-jun

Hi There, We solved this issue by changing our RAM stick. One of the RAM sticks was corrupted and had to be changed. We performed an experiment where we kept the GPU same and changed the RAM stick and this issue stopped happening. I'll advise you to try and remove RAM stick one by one and check if this issue stops coming.

May 03 '23 06:05 anidh

another possible cause fixed here: on windows i optimized the gpu profile via msi afterburner. Meaning lower voltages for specific mhz. getting the error, but after resetting the profile to standard running pytorch was no problem anymore.

i would have never thought it is because of that as the system is stable for nearly 2 years with the optimized profile. but yeahh... maybe someone else also has the same error and cause grafik

May 10 '23 14:05 eikaramba

An easy replication is to run any model repeatedly (with a small model nested in a for loop, using torch to clear cache everytime). First, the small version works no problem (so there shouldn't be dimension mismatch etc.) Then, run repeatedly in a loop, each time deleting the model reference and use torch to empty cache and force gc,even when the memory is in control it will fail (not on the start but usually after 5-20 minutes). Same model and loop runs fine on Colab(albeit slow). The errors should be one of "unspecified launch failure" "illegal memory".. encountered If you are in the jupyter notebook, you cannot continue to run any cuda code after getting the error. Without restarting the notebook kernel, every run of something .cuda() or .to_device() will immediately return the same error. If not using jupyter notebook, you have to restart the kernel. The torch is installed with

conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge

I have cuda 11.4, I am trying to change that to 11.6 and see what happens. (Just in case the new cuda works even if the torch didn't say it should). Ubuntu 22.04, driver 470 proprietary, 3090+i913900K+64G GDDR5 on msi Z790, power 1kw. On default bios. GPU temp is below 77C. Vent is 4 noctua fans on Corsair 4000 Airflow

May 18 '23 23:05 HaoLi111

import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader


torch.backends.cudnn.enabled = False


# Step 1: Load the data
data = np.random.rand(5000,20)

nVar = data.shape[1]


# Step 2: Create training and validation sets
train_data = data[:3000, :]
val_data = data[3000:, :]

# Step 3: Normalize the data
train_mean = train_data.mean(axis=0)
train_std = train_data.std(axis=0)
train_data = (train_data - train_mean) / train_std
val_data = (val_data - train_mean) / train_std

# Step 4: Create sequences of input data and target values
def create_sequences(data, seq_len):
    X = []
    y = []
    for i in range(seq_len, len(data)):
        X.append(data[i-seq_len:i,:])
        y.append(data[i, 0])
    return np.array(X), np.array(y)

seq_len = 20
train_X, train_y = create_sequences(train_data, seq_len)

val_X, val_y = create_sequences(val_data, seq_len)

# Step 5: Define the Transformer model
class TransformerModel(nn.Module):

    def __init__(self, input_dim, output_dim, n_heads, n_layers, dropout):
        super(TransformerModel, self).__init__()

        self.transformer_encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(
                d_model=input_dim,
                nhead=n_heads,
                dropout=dropout
            ),
            num_layers=n_layers
        )
        self.decoder = nn.Linear(input_dim, output_dim)

    def forward(self, x):
        x = self.transformer_encoder(x)
        x = self.decoder(x[:, -1, :])
        return x

# Step 6: Train the model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_dataset = torch.utils.data.TensorDataset(
    torch.tensor(train_X, dtype=torch.float32),
    torch.tensor(train_y, dtype=torch.float32)
)
val_dataset = torch.utils.data.TensorDataset(
    torch.tensor(val_X, dtype=torch.float32),
    torch.tensor(val_y, dtype=torch.float32)
)

batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, pin_memory=False)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

input_dim = train_X.shape[2]
output_dim = 1
n_heads = 2
n_layers = 2
dropout = 0.1

model = TransformerModel(input_dim, output_dim, n_heads, n_layers, dropout).cuda()
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

n_epochs = 10000
train_losses = []
val_losses = []
for epoch in range(n_epochs):
    # Train the model
    print(epoch)
    model.train()
    train_loss = 0.0
    for i, (inputs, targets) in enumerate(train_loader):
        inputs = inputs.cuda()
        targets = targets.cuda()
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs.view(-1), targets)
        loss.backward()
        optimizer.step()
        train_loss += loss.item() * inputs.size(0)
    train_loss /= len(train_loader.dataset)
    train_losses.append(train_loss)

    # Evaluate the model on the validation set

RuntimeError Traceback (most recent call last) in 102 outputs = model(inputs) 103 loss = criterion(outputs.view(-1), targets) --> 104 loss.backward() 105 optimizer.step() 106 train_loss += loss.item() * inputs.size(0)

~/anaconda3/envs/torch/lib/python3.7/site-packages/torch/_tensor.py in backward(self, gradient, retain_graph, create_graph, inputs) 394 create_graph=create_graph, 395 inputs=inputs) --> 396 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) 397 398 def register_hook(self, hook):

~/anaconda3/envs/torch/lib/python3.7/site-packages/torch/autograd/init.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs) 173 Variable.execution_engine.run_backward( # Calls into the C++ engine to run the backward pass 174 tensors, grad_tensors, retain_graph, create_graph, inputs, --> 175 allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass 176 177 def grad(

RuntimeError: CUDA error: unspecified launch failure

maybe I have error in the code? but the first few iterations run no problem, and it runs no problem on colab or on other PCs.

Same error if we put tensor core enabled, or put memory pin/ true or false

May 21 '23 18:05 HaoLi111

another possible cause fixed here: on windows i optimized the gpu profile via msi afterburner. Meaning lower voltages for specific mhz. getting the error, but after resetting the profile to standard running pytorch was no problem anymore.

i would have never thought it is because of that as the system is stable for nearly 2 years with the optimized profile. but yeahh... maybe someone else also has the same error and cause

I notice sometimes there is segmentation fault (but other program runs fine). I will try to write a large loop with other program that uses cuda and get back

May 21 '23 18:05 HaoLi111

Update: moving the Graphics card to another machine solves the problem ** on the same machine update the bios of MSI 790 A wifi (ddr5 version) from .1.0 to 0.4.0 solves the problem **

Maybe my version of the problem is somewhat unrelated to torch. Because just simply adding and substracting can fail when we add enough iteration.

Trying

using CUDA
N = 2^20
x = CUDA.fill(1.0f0, N)
y = CUDA.fill(2.0f0, N)


N3 = 50000
for k in 1:N3
N2 = 50000

for i in 1:N2
	CUDA.@sync y.+= x
	print("i=$i + k= $k")
end

for i in 1:N2
	CUDA.@sync y.-=x
	
	print("i=$i - k= $k")
end
end

print(Array(y))

leads to the same error (note this code is in Julia, not python, but I suppose the real problem is some Segmentation Fault --because essentially in C). So my giuess: 1. maybe unstable memory 2. maybe nvcc is handling too many loops that it thinks it is an infinite loop? The last post in https://stackoverflow.com/questions/9901803/cuda-error-message-unspecified-launch-failure suggests it, I suppose maybe the memory goes beyond a certain number of iterations and it kills the program?

@eikaramba how did you change the msi bios settings? is it possible to do it on linux?

May 21 '23 23:05 HaoLi111

I also encountered this error with RTX3090 (CUDA 12.1, docker image was based on nvidia/cuda:11.4.2-cudnn8-runtime-ubuntu20.04). I was training FaceRecognition model (MobileFaceNet on Glint360k dataset) from insightface: https://github.com/deepinsight/insightface/tree/master/recognition/arcface_torch. Here are the logs, I hope it can be useful when fixing this bug:

Traceback (most recent call last):
  File "/scratch/train_v2.py", line 257, in <module>
    main(parser.parse_args())
  File "/scratch/train_v2.py", line 189, in main
    torch.nn.utils.clip_grad_norm_(backbone.parameters(), 5)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/utils/clip_grad.py", line 76, in clip_grad_norm_
    torch._foreach_mul_(grads, clip_coef_clamped.to(device))  # type: ignore[call-overload]
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[W CUDAGuardImpl.h:124] Warning: CUDA warning: unspecified launch failure (function destroyEvent)
[W CUDAGuardImpl.h:124] Warning: CUDA warning: unspecified launch failure (function destroyEvent)
[W CUDAGuardImpl.h:124] Warning: CUDA warning: unspecified launch failure (function destroyEvent)
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8589f9e4d7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8589f6836b in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f858fc7ffa8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x13a0e (0x7f858fc50a0e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x22d80 (0x7f858fc5fd80 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4ccec6 (0x7f8558416ec6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7f8589f83e77 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f8589f7c69e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f8589f7c7b9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #9: c10d::Reducer::~Reducer() + 0x2a4 (0x7f8544489014 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #10: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f8558a8c3c2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #11: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7f85582f4f98 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0xb449c1 (0x7f8558a8e9c1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #13: <unknown function> + 0x3b4bab (0x7f85582febab in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #14: <unknown function> + 0x3b5b1f (0x7f85582ffb1f in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #26: <unknown function> + 0x29d90 (0x7f85e146ad90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #27: __libc_start_main + 0x80 (0x7f85e146ae40 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Jun 18 '23 20:06 IzhanVarsky

Update: moving the Graphics card to another machine solves the problem ** on the same machine update the bios of MSI 790 A wifi (ddr5 version) from .1.0 to 0.4.0 solves the problem **

Maybe my version of the problem is somewhat unrelated to torch. Because just simply adding and substracting can fail when we add enough iteration.

Trying
using CUDA
N = 2^20
x = CUDA.fill(1.0f0, N)
y = CUDA.fill(2.0f0, N)


N3 = 50000
for k in 1:N3
N2 = 50000

for i in 1:N2
	CUDA.@sync y.+= x
	print("i=$i + k= $k")
end

for i in 1:N2
	CUDA.@sync y.-=x
	
	print("i=$i - k= $k")
end
end

print(Array(y))	
leads to the same error (note this code is in Julia, not python, but I suppose the real problem is some Segmentation Fault --because essentially in C). So my giuess: 1. maybe unstable memory 2. maybe nvcc is handling too many loops that it thinks it is an infinite loop? The last post in https://stackoverflow.com/questions/9901803/cuda-error-message-unspecified-launch-failure suggests it, I suppose maybe the memory goes beyond a certain number of iterations and it kills the program?

@eikaramba how did you change the msi bios settings? is it possible to do it on linux? you can update the bios just following the procedure from official website, maybe is MSI or something like this.

Jun 20 '23 13:06 Aphlios

I have updated my bios version. And now my machine is running well.

Jun 20 '23 13:06 Aphlios

I met the same error today, also my model seemed to be trained enough so I tried to save it anyway but any further computation process requiring pytorch leads to the same error. I had to clear the process and restart model training to be able to use pytorch again.

Jun 23 '23 01:06 Sleipnir164

I have downgraded my drivers from CUDA 12.1 to 11.4, which is the version on another computer with RTX3090 I have no problems. But it didn't help. I also set this:

os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
torch.backends.cudnn.enabled = False

The running speed decreased dramatically, but in the end it ended with the same error:

terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: unspecified launch failure
Exception raised from record at ../aten/src/ATen/cuda/CUDAEvent.h:115 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7ff5a7f3020e in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xf3a88 (0x7ff5ea7f3a88 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: <unknown function> + 0xf6ffe (0x7ff5ea7f6ffe in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: <unknown function> + 0x4635b8 (0x7ff5f9b3a5b8 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7ff5a7f177a5 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x35f485 (0x7ff5f9a36485 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x6795c8 (0x7ff5f9d505c8 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object*) + 0x2d5 (0x7ff5f9d50995 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #8: python3() [0x5ed6cb]
frame #9: python3() [0x5edb90]
frame #10: python3() [0x5446a8]
frame #11: python3() [0x61569c]
<omitting python frames>
frame #13: python3() [0x5011a6]
frame #15: python3() [0x50b07e]
frame #17: python3() [0x50b07e]
frame #20: python3() [0x50b1f0]
frame #26: python3() [0x67dbf1]
frame #27: python3() [0x67dc6f]
frame #28: python3() [0x67dd11]
frame #32: __libc_start_main + 0xf3 (0x7ff614d0b083 in /usr/lib/x86_64-linux-gnu/libc.so.6)

./entry.sh: line 3:     7 Aborted                 (core dumped) python3 ./run_resnext_training.py

And after this another error on the following running script:

Traceback (most recent call last):
  File "./get_fail_images.py", line 64, in <module>
    run_test('.', 'train')
  File "./get_fail_images.py", line 53, in run_test
    value, pred = classifier.predict_label(img_path)
  File "/scratch/classifier.py", line 231, in predict_label
    return self.predict_label_for_pil_img(Image.open(img_path))
  File "/scratch/classifier.py", line 226, in predict_label_for_pil_img
    outputs = self.model(img_transformed[None, :]).softmax(1)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torchvision/models/resnet.py", line 285, in forward
    return self._forward_impl(x)
  File "/usr/local/lib/python3.8/dist-packages/torchvision/models/resnet.py", line 275, in _forward_impl
    x = self.layer3(x)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torchvision/models/resnet.py", line 146, in forward
    out = self.conv1(x)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: basic_string::_S_construct null not valid
Exception ignored in tp_clear of: <class 'cell'>
TypeError: object.__init__() takes exactly one argument (the instance to initialize)
./entry.sh: line 4:  1092 Segmentation fault      (core dumped) python3 ./get_fail_images.py

dmseg command shows this:

[84075.897779] NVRM: GPU at PCI:0000:01:00: GPU-9d8769e9-21ca-19df-b13c-82dd86299a8f
[84075.897783] NVRM: Xid (PCI:0000:01:00): 69, pid=2489, Class Error: ChId 0010, Class 0000c7c0, Offset 000001b0, Data 00000041, ErrorCode 00000053
[84673.018140] python3[6621]: segfault at 0 ip 0000000000000000 sp 00007ffd994b31c8 error 14 in python3.8[400000+23000]
[84673.018147] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[84673.089158] docker0: port 1(veth94bb7bc) entered disabled state
[84673.089185] veth4ddace2: renamed from eth0
[84673.204877] docker0: port 1(veth94bb7bc) entered disabled state
[84673.205099] device veth94bb7bc left promiscuous mode
[84673.205101] docker0: port 1(veth94bb7bc) entered disabled state

My environment: CPU: AMD Ryzen 9 7950X GPU: RTX3090 CUDA driver: 470.182.03 (11.4) motherboard: X670 AORUS ELITE AX (BIOS F5) Docker version 24.0.2, build cb74dfc Base image: nvcr.io/nvidia/cuda:11.4.2-cudnn8-runtime-ubuntu20.04 Memory: 64G DDR5

I'm going to update my BIOS soon

Jun 28 '23 20:06 IzhanVarsky

Updating BIOS really helped! I've trained a model continuously for 2 days and had no crash with this error.

Jul 01 '23 23:07 IzhanVarsky

I have updated my bios version. And now my machine is running well.

In my case, so far so good after updating BIOS!

Sep 01 '23 05:09 JANGSOONMYUN

The BIOS update helped me as well! But when I'm training multiple models at the same model and pushing my desktop to its limit, I'm still able to get the error. But much better than before!

Sep 12 '23 01:09 norton-chris