RuntimeError: CUDA error: unspecified launch failure
Issue description
RuntimeError: CUDA error: unspecified launch failure Error occurring on any training script. Occurrence is not deterministic. Can occur at anytime during the course of training. All the codes work fine on RTX 3090.
/lib/python3.8/site-packages/torch/autograd/init.py Variable._execution_engine.run_backward( RuntimeError: CUDA error: unspecified launch failure
System Info
PyTorch version: 1.10.2+cu113 Is debug build: False CUDA used to build PyTorch: 11.3 ROCM used to build PyTorch: N/A
OS: Pop!_OS 20.04 LTS (x86_64) GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.31
Python version: 3.8.10 (default, Nov 26 2021, 20:14:08) [GCC 9.3.0] (64-bit runtime) Python platform: Linux-5.15.15-76051515-generic-x86_64-with-glibc2.29 Is CUDA available: True CUDA runtime version: 11.2.67 GPU models and configuration: GPU 0: NVIDIA RTX A6000 GPU 1: NVIDIA RTX A6000
Nvidia driver version: 470.86 cuDNN version: Probably one of the following: /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn.so.8.1.1 /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.1.1 /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.1.1 /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.1.1 /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.1.1 /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.1.1 /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.1.1 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
Versions of relevant libraries: [pip3] numpy==1.22.3 [pip3] torch==1.10.2+cu113 [pip3] torchaudio==0.10.2+cu113 [pip3] torchvision==0.11.3+cu113 [conda] Could not collect
- PyTorch or Caffe2: PyTorch
- How you installed PyTorch (conda, pip, source): pip
- Build command you used (if compiling from source):
- OS: Pop OS 20.04
- PyTorch version: '1.10.2+cu113'
- Python version: 3.8
- CUDA/cuDNN version: 11.3
- GPU models and configuration: A6000
- GCC version (if compiling from source):
- CMake version:
- Versions of any other relevant libraries:
cc @csarofeen @ptrblck @xwang233 @ngimel
Hi, do you have a code snippet which we can use to reproduce the issue?
@anidh
Hi @H-Huang
I'm using the ultralytics yolov5 repo to train the model. The command which I'm using to train the model is
python train.py --img 640 --batch 32 --epochs 400 --data idd.yaml --weights yolov5x.pt --rect --image-weights --evolve --device 0,1 --multi-scale --name demo-img --patience 30 --save-period 1 --worker 22 --quad The error is very random and can happen at the very 1st epoch or can happen at the 10th epoch and there is no certain way to know when it'll happen.
The CPU which I'm using is AMD Ryzen ThreadRripper PRO 3975WX.
Error encountered when replacing the A6000 with RTX3090
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from record at /home/avlabs_blue/pytorch/aten/src/ATen/cuda/CUDAEvent.h:119 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits
Aborted (core dumped)
Environement Dertails Collecting environment information... PyTorch version: 1.12.0a0+gitd5744f4 Is debug build: False CUDA used to build PyTorch: 11.2 ROCM used to build PyTorch: N/A
OS: Pop!_OS 20.04 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.31
Python version: 3.8.10 (default, Nov 26 2021, 20:14:08) [GCC 9.3.0] (64-bit runtime) Python platform: Linux-5.16.11-76051611-generic-x86_64-with-glibc2.29 Is CUDA available: True CUDA runtime version: 11.2.67 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 Nvidia driver version: 510.54 cuDNN version: Probably one of the following: /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn.so.8.1.1 /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.1.1 /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.1.1 /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.1.1 /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.1.1 /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.1.1 /usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.1.1 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.22.3
[pip3] torch==1.12.0a0+gitd5744f4
[pip3] torchaudio==0.10.2+cu113
[pip3] torchvision==0.13.0a0+00c119c
[conda] magma-cuda110 2.5.2 1 pytorch
[conda] mkl 2022.0.1 h06a4308_117
[conda] mkl-include 2022.0.1 h06a4308_117
[conda] numpy 1.21.2 py38hd8d4704_0
[conda] numpy-base 1.21.2 py38h2b8c604_0
[conda] torch 1.12.0a0+gitd5744f4 dev_0
A much simplified code that can be run is repo
We run the command CUDA_LAUNCH_BLOCKING=1 python train_cifar10.py --net res101 --bs 256
File "/home/avlabs_blue/.virtualenvs/ultralytics/lib/python3.8/site-packages/torch/_tensor.py", line 399, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/avlabs_blue/.virtualenvs/ultralytics/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.
import torch torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.benchmark = True torch.backends.cudnn.deterministic = False torch.backends.cudnn.allow_tf32 = True data = torch.randn([256, 256, 8, 8], dtype=torch.float, device='cuda', requires_grad=True) net = torch.nn.Conv2d(256, 256, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1) net = net.cuda().float() out = net(data) out.backward(torch.randn_like(out)) torch.cuda.synchronize()
ConvolutionParams data_type = CUDNN_DATA_FLOAT padding = [1, 1, 0] stride = [1, 1, 0] dilation = [1, 1, 0] groups = 1 deterministic = false allow_tf32 = true input: TensorDescriptor 0xa5c1afc0 type = CUDNN_DATA_FLOAT nbDims = 4 dimA = 256, 256, 8, 8, strideA = 16384, 64, 8, 1, output: TensorDescriptor 0xa7ab5f40 type = CUDNN_DATA_FLOAT nbDims = 4 dimA = 256, 256, 8, 8, strideA = 16384, 64, 8, 1, weight: FilterDescriptor 0x7f44a46edba0 type = CUDNN_DATA_FLOAT tensor_format = CUDNN_TENSOR_NCHW nbDims = 4 dimA = 256, 256, 3, 3, Pointer addresses: input: 0x7f4523800000 output: 0x7f4524800000 weight: 0x7f4613e80000 Additional pointer addresses: grad_output: 0x7f4524800000 grad_weight: 0x7f4613e80000 Backward filter algorithm: 5
Hi @prabhatkumar95 , thanks for reporting this issue. I saw the pytorch version you have is
PyTorch version: 1.12.0a0+gitd5744f4
Is debug build: False
CUDA used to build PyTorch: 11.2
Did you build pytorch from source? CUDA 11.2 is not recommended. I'd suggest you try with the latest CUDA 11.6 and CUDNN 8.3.3 if you prefer to build from source, or a pre-built pip wheel with CUDA 11.5 on https://pytorch.org/get-started/locally/ (just replace every 113 with 115 in the link).
Hi @zwang233 I tried building from source as well as using docker from here but still is the same error.
Hey there. I have had the same issue. This has been happening consistently across different code-bases (After random number of epochs I get a CUDA error. Most commonly with "unspecifed launch failure"). The same code runs however completely fine on my friends PC although he is also using a 3090 gpu. My first thought was then that my GPU might be broken. So we switched GPU's. But nothing changed. It runs on his but not on mine. I installed pytorch the recommended way https://pytorch.org/get-started/locally/. I originally come from this issue: https://github.com/pytorch/pytorch/issues/27837 where it suggested in the end that it might be related to using AMD processors. Indeed I have an ADM-processor and the PC of my friend uses an Intel cpu. Can anyone deny or support this theory? Any help is very welcomed. I have tried debugging this error for the past weeks without success. My only hope would now be, that its actually amd-cpu related. However I would like to be really sure before exchanging my Cpu and motherboard for only that reason
Hi @municola, I have Intel® Core™ i7-10700K CPU @ 3.80GHz × 16 and RTX 3090, and last night I haved encountered this error. I didn't have it in for the same code repo before, still don't know if it comes from my code change or some random factors.
I also experienced these issues across various projects (all using torch), using a machine with 4x 3090s and an AMD Ryzen Threadripper 3960X 24-Core Processor, NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6.
The only pattern I noticed repeatedly is that it seems to happen more often when I use more than 2 GPUs. Whenever I only used 2x GPUs, it hasn't happened yet. Besides that, it occurs at random times and across different repos (e.g. using different torch versions) and projects (e.g. in both computer vision and NLP pipelines).
Also experiencing this issue on an RTX 4090!
Same exact scenario as people described here #27837, where the error always occurs randomly anywhere from 20 minutes to 20 hours into training.
Sometimes
unspecified launch failure,
and sometimes
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP).
I have tried each of the following and haven't been able to eliminate this error yet:
- OS: Both native Ubuntu and also WSL
- Pytorch: Nightly (2.0.0.dev20230226+cu118), and manually building from source with cuda 12
- Driver: 520.56 and 525.89
Even though this is brand new hardware I've tried to rule out any potential hardware issue, so I have also tested:
- GPU power limit reduced to 50%
- Application memory usage reduced 50%
- Running gpu-burner with no errors
The logs running dmesg shows Xid error 109:
NVRM: Xid (PCI:0000:01:00): 109, pid=4124, name=python, Ch 00000028, errorString CTX SWITCH TIMEOUT, Info 0x2c014
Xid 109 does not even exist in nvidia's documentation??
Any insight on this issue would be greatly appreciated, thanks!
In my case, it seems the issue may have had to do with pinned memory. After disabling it I have not yet encountered the error. If anyone tries this please let me know if it fixed it for you.
@PWhiddy I am facing a simliar issue while training yolov8 for segmentation on a custom dataset. I have been trying to solve this issue for past week but still unable to resolve it. Could you point out how you resolved it by disabling pinned memory? Was this a change in CUDA pinned memory or the pytorch one?
Thanks & Regards

@rajpalaakash try running with cuda launch blocking to see if there’s a more specific error.
I'm encountering the same issue when running my code on an RTX 4090 GPU. Interestingly, the code executes without any errors on both RTX 2080 Ti and A6000 GPUs. The error appears after a random number of epochs, interrupting the training process.
Hi There, We solved this issue by changing our RAM stick. One of the RAM sticks was corrupted and had to be changed. We performed an experiment where we kept the GPU same and changed the RAM stick and this issue stopped happening. I'll advise you to try and remove RAM stick one by one and check if this issue stops coming.
another possible cause fixed here: on windows i optimized the gpu profile via msi afterburner. Meaning lower voltages for specific mhz. getting the error, but after resetting the profile to standard running pytorch was no problem anymore.
i would have never thought it is because of that as the system is stable for nearly 2 years with the optimized profile. but yeahh... maybe someone else also has the same error and cause
An easy replication is to run any model repeatedly (with a small model nested in a for loop, using torch to clear cache everytime). First, the small version works no problem (so there shouldn't be dimension mismatch etc.) Then, run repeatedly in a loop, each time deleting the model reference and use torch to empty cache and force gc,even when the memory is in control it will fail (not on the start but usually after 5-20 minutes). Same model and loop runs fine on Colab(albeit slow). The errors should be one of "unspecified launch failure" "illegal memory".. encountered If you are in the jupyter notebook, you cannot continue to run any cuda code after getting the error. Without restarting the notebook kernel, every run of something .cuda() or .to_device() will immediately return the same error. If not using jupyter notebook, you have to restart the kernel. The torch is installed with
conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge
I have cuda 11.4, I am trying to change that to 11.6 and see what happens. (Just in case the new cuda works even if the torch didn't say it should). Ubuntu 22.04, driver 470 proprietary, 3090+i913900K+64G GDDR5 on msi Z790, power 1kw. On default bios. GPU temp is below 77C. Vent is 4 noctua fans on Corsair 4000 Airflow
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
torch.backends.cudnn.enabled = False
# Step 1: Load the data
data = np.random.rand(5000,20)
nVar = data.shape[1]
# Step 2: Create training and validation sets
train_data = data[:3000, :]
val_data = data[3000:, :]
# Step 3: Normalize the data
train_mean = train_data.mean(axis=0)
train_std = train_data.std(axis=0)
train_data = (train_data - train_mean) / train_std
val_data = (val_data - train_mean) / train_std
# Step 4: Create sequences of input data and target values
def create_sequences(data, seq_len):
X = []
y = []
for i in range(seq_len, len(data)):
X.append(data[i-seq_len:i,:])
y.append(data[i, 0])
return np.array(X), np.array(y)
seq_len = 20
train_X, train_y = create_sequences(train_data, seq_len)
val_X, val_y = create_sequences(val_data, seq_len)
# Step 5: Define the Transformer model
class TransformerModel(nn.Module):
def __init__(self, input_dim, output_dim, n_heads, n_layers, dropout):
super(TransformerModel, self).__init__()
self.transformer_encoder = nn.TransformerEncoder(
nn.TransformerEncoderLayer(
d_model=input_dim,
nhead=n_heads,
dropout=dropout
),
num_layers=n_layers
)
self.decoder = nn.Linear(input_dim, output_dim)
def forward(self, x):
x = self.transformer_encoder(x)
x = self.decoder(x[:, -1, :])
return x
# Step 6: Train the model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_dataset = torch.utils.data.TensorDataset(
torch.tensor(train_X, dtype=torch.float32),
torch.tensor(train_y, dtype=torch.float32)
)
val_dataset = torch.utils.data.TensorDataset(
torch.tensor(val_X, dtype=torch.float32),
torch.tensor(val_y, dtype=torch.float32)
)
batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, pin_memory=False)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
input_dim = train_X.shape[2]
output_dim = 1
n_heads = 2
n_layers = 2
dropout = 0.1
model = TransformerModel(input_dim, output_dim, n_heads, n_layers, dropout).cuda()
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
n_epochs = 10000
train_losses = []
val_losses = []
for epoch in range(n_epochs):
# Train the model
print(epoch)
model.train()
train_loss = 0.0
for i, (inputs, targets) in enumerate(train_loader):
inputs = inputs.cuda()
targets = targets.cuda()
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs.view(-1), targets)
loss.backward()
optimizer.step()
train_loss += loss.item() * inputs.size(0)
train_loss /= len(train_loader.dataset)
train_losses.append(train_loss)
# Evaluate the model on the validation set
RuntimeError Traceback (most recent call last)
~/anaconda3/envs/torch/lib/python3.7/site-packages/torch/_tensor.py in backward(self, gradient, retain_graph, create_graph, inputs) 394 create_graph=create_graph, 395 inputs=inputs) --> 396 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) 397 398 def register_hook(self, hook):
~/anaconda3/envs/torch/lib/python3.7/site-packages/torch/autograd/init.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs) 173 Variable.execution_engine.run_backward( # Calls into the C++ engine to run the backward pass 174 tensors, grad_tensors, retain_graph, create_graph, inputs, --> 175 allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass 176 177 def grad(
RuntimeError: CUDA error: unspecified launch failure
maybe I have error in the code? but the first few iterations run no problem, and it runs no problem on colab or on other PCs.
Same error if we put tensor core enabled, or put memory pin/ true or false
another possible cause fixed here: on windows i optimized the gpu profile via msi afterburner. Meaning lower voltages for specific mhz. getting the error, but after resetting the profile to standard running pytorch was no problem anymore.
i would have never thought it is because of that as the system is stable for nearly 2 years with the optimized profile. but yeahh... maybe someone else also has the same error and cause
I notice sometimes there is segmentation fault (but other program runs fine). I will try to write a large loop with other program that uses cuda and get back
Update: moving the Graphics card to another machine solves the problem ** on the same machine update the bios of MSI 790 A wifi (ddr5 version) from .1.0 to 0.4.0 solves the problem **
Maybe my version of the problem is somewhat unrelated to torch. Because just simply adding and substracting can fail when we add enough iteration.
Trying
using CUDA
N = 2^20
x = CUDA.fill(1.0f0, N)
y = CUDA.fill(2.0f0, N)
N3 = 50000
for k in 1:N3
N2 = 50000
for i in 1:N2
CUDA.@sync y.+= x
print("i=$i + k= $k")
end
for i in 1:N2
CUDA.@sync y.-=x
print("i=$i - k= $k")
end
end
print(Array(y))
leads to the same error (note this code is in Julia, not python, but I suppose the real problem is some Segmentation Fault --because essentially in C). So my giuess: 1. maybe unstable memory 2. maybe nvcc is handling too many loops that it thinks it is an infinite loop? The last post in https://stackoverflow.com/questions/9901803/cuda-error-message-unspecified-launch-failure suggests it, I suppose maybe the memory goes beyond a certain number of iterations and it kills the program?
@eikaramba how did you change the msi bios settings? is it possible to do it on linux?
I also encountered this error with RTX3090 (CUDA 12.1, docker image was based on nvidia/cuda:11.4.2-cudnn8-runtime-ubuntu20.04). I was training FaceRecognition model (MobileFaceNet on Glint360k dataset) from insightface: https://github.com/deepinsight/insightface/tree/master/recognition/arcface_torch. Here are the logs, I hope it can be useful when fixing this bug:
Traceback (most recent call last):
File "/scratch/train_v2.py", line 257, in <module>
main(parser.parse_args())
File "/scratch/train_v2.py", line 189, in main
torch.nn.utils.clip_grad_norm_(backbone.parameters(), 5)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/utils/clip_grad.py", line 76, in clip_grad_norm_
torch._foreach_mul_(grads, clip_coef_clamped.to(device)) # type: ignore[call-overload]
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[W CUDAGuardImpl.h:124] Warning: CUDA warning: unspecified launch failure (function destroyEvent)
[W CUDAGuardImpl.h:124] Warning: CUDA warning: unspecified launch failure (function destroyEvent)
[W CUDAGuardImpl.h:124] Warning: CUDA warning: unspecified launch failure (function destroyEvent)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8589f9e4d7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8589f6836b in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f858fc7ffa8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x13a0e (0x7f858fc50a0e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x22d80 (0x7f858fc5fd80 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4ccec6 (0x7f8558416ec6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7f8589f83e77 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f8589f7c69e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f8589f7c7b9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #9: c10d::Reducer::~Reducer() + 0x2a4 (0x7f8544489014 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #10: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f8558a8c3c2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #11: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7f85582f4f98 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0xb449c1 (0x7f8558a8e9c1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #13: <unknown function> + 0x3b4bab (0x7f85582febab in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #14: <unknown function> + 0x3b5b1f (0x7f85582ffb1f in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #26: <unknown function> + 0x29d90 (0x7f85e146ad90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #27: __libc_start_main + 0x80 (0x7f85e146ae40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Update: moving the Graphics card to another machine solves the problem ** on the same machine update the bios of MSI 790 A wifi (ddr5 version) from .1.0 to 0.4.0 solves the problem **
Maybe my version of the problem is somewhat unrelated to torch. Because just simply adding and substracting can fail when we add enough iteration.
Trying
using CUDA N = 2^20 x = CUDA.fill(1.0f0, N) y = CUDA.fill(2.0f0, N) N3 = 50000 for k in 1:N3 N2 = 50000 for i in 1:N2 CUDA.@sync y.+= x print("i=$i + k= $k") end for i in 1:N2 CUDA.@sync y.-=x print("i=$i - k= $k") end end print(Array(y))leads to the same error (note this code is in Julia, not python, but I suppose the real problem is some Segmentation Fault --because essentially in C). So my giuess: 1. maybe unstable memory 2. maybe nvcc is handling too many loops that it thinks it is an infinite loop? The last post in https://stackoverflow.com/questions/9901803/cuda-error-message-unspecified-launch-failure suggests it, I suppose maybe the memory goes beyond a certain number of iterations and it kills the program?
@eikaramba how did you change the msi bios settings? is it possible to do it on linux? you can update the bios just following the procedure from official website, maybe is MSI or something like this.
I have updated my bios version. And now my machine is running well.
I met the same error today, also my model seemed to be trained enough so I tried to save it anyway but any further computation process requiring pytorch leads to the same error. I had to clear the process and restart model training to be able to use pytorch again.
I have downgraded my drivers from CUDA 12.1 to 11.4, which is the version on another computer with RTX3090 I have no problems. But it didn't help. I also set this:
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
torch.backends.cudnn.enabled = False
The running speed decreased dramatically, but in the end it ended with the same error:
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: unspecified launch failure
Exception raised from record at ../aten/src/ATen/cuda/CUDAEvent.h:115 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7ff5a7f3020e in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xf3a88 (0x7ff5ea7f3a88 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: <unknown function> + 0xf6ffe (0x7ff5ea7f6ffe in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: <unknown function> + 0x4635b8 (0x7ff5f9b3a5b8 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7ff5a7f177a5 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x35f485 (0x7ff5f9a36485 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x6795c8 (0x7ff5f9d505c8 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object*) + 0x2d5 (0x7ff5f9d50995 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #8: python3() [0x5ed6cb]
frame #9: python3() [0x5edb90]
frame #10: python3() [0x5446a8]
frame #11: python3() [0x61569c]
<omitting python frames>
frame #13: python3() [0x5011a6]
frame #15: python3() [0x50b07e]
frame #17: python3() [0x50b07e]
frame #20: python3() [0x50b1f0]
frame #26: python3() [0x67dbf1]
frame #27: python3() [0x67dc6f]
frame #28: python3() [0x67dd11]
frame #32: __libc_start_main + 0xf3 (0x7ff614d0b083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
./entry.sh: line 3: 7 Aborted (core dumped) python3 ./run_resnext_training.py
And after this another error on the following running script:
Traceback (most recent call last):
File "./get_fail_images.py", line 64, in <module>
run_test('.', 'train')
File "./get_fail_images.py", line 53, in run_test
value, pred = classifier.predict_label(img_path)
File "/scratch/classifier.py", line 231, in predict_label
return self.predict_label_for_pil_img(Image.open(img_path))
File "/scratch/classifier.py", line 226, in predict_label_for_pil_img
outputs = self.model(img_transformed[None, :]).softmax(1)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torchvision/models/resnet.py", line 285, in forward
return self._forward_impl(x)
File "/usr/local/lib/python3.8/dist-packages/torchvision/models/resnet.py", line 275, in _forward_impl
x = self.layer3(x)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torchvision/models/resnet.py", line 146, in forward
out = self.conv1(x)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/conv.py", line 457, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: basic_string::_S_construct null not valid
Exception ignored in tp_clear of: <class 'cell'>
TypeError: object.__init__() takes exactly one argument (the instance to initialize)
./entry.sh: line 4: 1092 Segmentation fault (core dumped) python3 ./get_fail_images.py
dmseg command shows this:
[84075.897779] NVRM: GPU at PCI:0000:01:00: GPU-9d8769e9-21ca-19df-b13c-82dd86299a8f
[84075.897783] NVRM: Xid (PCI:0000:01:00): 69, pid=2489, Class Error: ChId 0010, Class 0000c7c0, Offset 000001b0, Data 00000041, ErrorCode 00000053
[84673.018140] python3[6621]: segfault at 0 ip 0000000000000000 sp 00007ffd994b31c8 error 14 in python3.8[400000+23000]
[84673.018147] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[84673.089158] docker0: port 1(veth94bb7bc) entered disabled state
[84673.089185] veth4ddace2: renamed from eth0
[84673.204877] docker0: port 1(veth94bb7bc) entered disabled state
[84673.205099] device veth94bb7bc left promiscuous mode
[84673.205101] docker0: port 1(veth94bb7bc) entered disabled state
My environment: CPU: AMD Ryzen 9 7950X GPU: RTX3090 CUDA driver: 470.182.03 (11.4) motherboard: X670 AORUS ELITE AX (BIOS F5) Docker version 24.0.2, build cb74dfc Base image: nvcr.io/nvidia/cuda:11.4.2-cudnn8-runtime-ubuntu20.04 Memory: 64G DDR5
I'm going to update my BIOS soon
Updating BIOS really helped! I've trained a model continuously for 2 days and had no crash with this error.
I have updated my bios version. And now my machine is running well.
In my case, so far so good after updating BIOS!
The BIOS update helped me as well! But when I'm training multiple models at the same model and pushing my desktop to its limit, I'm still able to get the error. But much better than before!
