pytorch icon indicating copy to clipboard operation
pytorch copied to clipboard

CUDA: Illegal memory access in `torch.linalg.solve()`

Open GuillaumeRochette opened this issue 2 years ago • 11 comments

🐛 Describe the bug

Hi,

My program randomly terminates both during training and validation and I strongly suspect that it is due to torch.linalg.solve(), although I have not been able to reproduce the bug in a simple script with a for loop. Here's a snippet for convenience, (ignore the edge case where the rows of x would be linearly dependent, as this does not happens in the actual function):

import torch

device = torch.device("cuda")

x = torch.randn(64, 81, 9, 5, device=device)
y = torch.randn(64, 81, 9, 1, device=device)

A = x.transpose(-1, -2) @ x
B = x.transpose(-1, -2) @ y
b = torch.linalg.solve(A, B)

The following error is consistently returned after enough epochs:

CUDA runtime error: an illegal memory access was encountered (700) in apply_lu_factor_batched_magma at /opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/BatchLinearAlgebra.cpp:1910                         
CUDA runtime error: an illegal memory access was encountered (700) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda113_1619629459349/work/interface_cuda/interface.cpp:944                                         
CUDA runtime error: an illegal memory access was encountered (700) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda113_1619629459349/work/interface_cuda/interface.cpp:945                                         
CUDA runtime error: an illegal memory access was encountered (700) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda113_1619629459349/work/interface_cuda/interface.cpp:946                                         
[...]
File "x.py", line 164, in g
    b = torch.linalg.solve(A, B)
RuntimeError: CUDA error: an illegal memory access was encountered                                                 
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.                                                                                                     
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
[...]
File "x.py", line 164, in g
    b = torch.linalg.solve(A, B)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1646756402876/work/c10/cuda/CUDACachingAllocator.cpp:1230 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f3e10d9d1bd in /workspace/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1f037 (0x7f3e433ea037 in /workspace/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x23a (0x7f3e433ee3ea in /workspace/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x2ecdd8 (0x7f3e93de3dd8 in /workspace/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f3e10d83fb5 in /workspace/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x1db769 (0x7f3e93cd2769 in /workspace/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x4c6c8c (0x7f3e93fbdc8c in /workspace/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object*) + 0x292 (0x7f3e93fbdf92 in /workspace/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x129bcb (0x5555ecf2cbcb in /workspace/miniconda3/bin/python)
frame #9: <unknown function> + 0x2429aa (0x5555ed0459aa in /workspace/miniconda3/bin/python)
frame #10: <unknown function> + 0x129d5b (0x5555ecf2cd5b in /workspace/miniconda3/bin/python)
frame #11: <unknown function> + 0x194655 (0x5555ecf97655 in /workspace/miniconda3/bin/python)
frame #12: <unknown function> + 0x129bcb (0x5555ecf2cbcb in /workspace/miniconda3/bin/python)
frame #13: <unknown function> + 0x2429aa (0x5555ed0459aa in /workspace/miniconda3/bin/python)
frame #14: <unknown function> + 0x129d5b (0x5555ecf2cd5b in /workspace/miniconda3/bin/python)
frame #15: <unknown function> + 0x194655 (0x5555ecf97655 in /workspace/miniconda3/bin/python)
frame #16: <unknown function> + 0x129bcb (0x5555ecf2cbcb in /workspace/miniconda3/bin/python)
frame #17: <unknown function> + 0x2429aa (0x5555ed0459aa in /workspace/miniconda3/bin/python)
frame #18: <unknown function> + 0x129d5b (0x5555ecf2cd5b in /workspace/miniconda3/bin/python)
frame #19: <unknown function> + 0x12a950 (0x5555ecf2d950 in /workspace/miniconda3/bin/python)
frame #20: <unknown function> + 0x13a9dd (0x5555ecf3d9dd in /workspace/miniconda3/bin/python)
frame #21: _PyGC_CollectNoFail + 0x35 (0x5555ed05d705 in /workspace/miniconda3/bin/python)
frame #22: <unknown function> + 0x2744ba (0x5555ed0774ba in /workspace/miniconda3/bin/python)
frame #23: Py_FinalizeEx + 0x186 (0x5555ed0777a6 in /workspace/miniconda3/bin/python)
frame #24: Py_RunMain + 0x10c (0x5555ed07ce8c in /workspace/miniconda3/bin/python)
frame #25: Py_BytesMain + 0x39 (0x5555ed07d309 in /workspace/miniconda3/bin/python)
frame #26: __libc_start_main + 0xf3 (0x7f3ec6e2a0b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #27: <unknown function> + 0x2010a0 (0x5555ed0040a0 in /workspace/miniconda3/bin/python)

This happened both on single GPU and multi GPU (DDP) settings. Am I better off using something like this even though this is not recommended?

b = A.inverse() @ B

If you have got any pointers on what I should be doing instead :)

Best regards,

Versions

Collecting environment information...
PyTorch version: 1.11.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.9.12 (main, Apr 5 2022, 06:56:58) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.13.0-40-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.4.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.5
[pip3] pytorch-lightning==1.6.3
[pip3] torch==1.11.0
[pip3] torchaudio==0.11.0
[pip3] torchmetrics==0.8.2
[pip3] torchvision==0.12.0
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.3.1 h2bc3f7f_2
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py39h7f8727e_0
[conda] mkl_fft 1.3.1 py39hd3c417c_0
[conda] mkl_random 1.2.2 py39h51133e4_0
[conda] numpy 1.21.5 py39he7a7128_2
[conda] numpy-base 1.21.5 py39hf524024_2
[conda] pytorch 1.11.0 py3.9_cuda11.3_cudnn8.2.0_0 pytorch
[conda] pytorch-lightning 1.6.3 pypi_0 pypi
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torchaudio 0.11.0 py39_cu113 pytorch
[conda] torchmetrics 0.8.2 pypi_0 pypi
[conda] torchvision 0.12.0 py39_cu113 pytorch

cc @ezyang @gchanan @zou3519 @ngimel @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @Lezcano

GuillaumeRochette avatar May 16 '22 08:05 GuillaumeRochette

I got the same problem.

ZuoJiaxing avatar May 17 '22 19:05 ZuoJiaxing

I had the same problem with pytorch 1.12.0 and cuda 11.6

quancs avatar Aug 02 '22 07:08 quancs

I'm also encountering this problem in pytorch 1.12.0 and cuda 11.6. -- "CUDA runtime error: an illegal memory access was encountered (700) in apply_lu_factor_batched_magma..."

The problem does not occur every time the "apply_lu_factor_batched_magma" function is called. But, this function is called for every minibatch by some Kornia data augmentation functions, so eventually most training runs crash with this error. This is with toy-sized models on mnist data which use <25% of GPU memory.

Philip-Bachman avatar Aug 18 '22 16:08 Philip-Bachman

cc @IvanYashchuk

ngimel avatar Aug 18 '22 19:08 ngimel

Temporary workaround is to try setting torch.backends.cuda.preferred_linalg_library("cusolver") in the script.

IvanYashchuk avatar Aug 18 '22 20:08 IvanYashchuk

Does this still happen in master? This function was improved quite a bit after the 1.12 release.

lezcano avatar Aug 18 '22 21:08 lezcano

Actually, yes, this path is also used in master

lezcano avatar Aug 18 '22 21:08 lezcano

As another workaround, set CUDA_LAUNCH_BLOCKING=1. Then, this error disappears in my environment.

soskek avatar Aug 23 '22 22:08 soskek

Just hit this bug as well on 1.12.1 w/ CUDA 11.6.2 -- CUDA_LAUNCH_BLOCKING=1 is not really a solution as it typically bottlenecks training massively. Trying the preffered backend proposed by @IvanYashchuk. Is there a fix planned for master?

jramapuram avatar Sep 01 '22 18:09 jramapuram

I also encountered the problem during training and now I can't reproduce it. I'm using torch 1.12.0+cu102. It happened on https://kornia.readthedocs.io/en/latest/geometry.transform.html that uses torch.linalg.solve internally.

ddanevskyi avatar Sep 21 '22 05:09 ddanevskyi

@IvanYashchuk may this one be related to https://github.com/pytorch/pytorch/issues/82894 at all?

lezcano avatar Sep 21 '22 08:09 lezcano