dorado icon indicating copy to clipboard operation
dorado copied to clipboard

CUDA illegal memory access was encountered with Dorado v0.7.1

Open fehofman opened this issue 1 year ago • 5 comments

Issue Report

Please describe the issue:

Hi,

we just saw that Dorado v0.7.1 is out and there are fixes for autobatch calculation with modbase models when trying to call on multiple GPUs. Unfortunately we are still receiving a CUDA error when we try to do SUP + modification calling (m6A,pseU) on more than one GPU.

The workaround to reduce batch size to 320 as proposed by @charlotte-ht in issue #842 is still needed.

Any ideas why this error persists? Is there a timeline to have a proposed fix in place?

Thank you for looking into this and your continuous efforts to optimize dorado and to provide basecalling for the RNA004 kit!

Steps to reproduce the issue:

dorado basecaller sup,pseU,m6A ./UHRR_pod5_test/ --device "cuda:0,cuda:1" > ./test_UHRR_output/test.bam

Run environment:

  • Dorado version: v0.7.1
  • Operating system: Ubuntu 20.04.6 LTS
  • Hardware (CPUs, Memory, GPUs): nvidia-smi: NVIDIA-SMI 450.248.02 Driver Version: 450.248.02 CUDA Version: 11.0 ; GPUs: A100-SXM4-80GB
  • Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5; UHRR
  • Source data location (on device or networked drive - NFS, etc.): Local SSD
  • Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): RNA004; ~25k reads; size around 1GB

Logs

[2024-06-04 13:22:15.851] [info] Normalised: overlap 500 -> 492 [2024-06-04 13:22:15.851] [info] > Creating basecall pipeline [2024-06-04 13:22:15.852] [info] - BAM format does not support U, so RNA output files will include T instead of U for all file types. [2024-06-04 13:22:27.997] [info] cuda:0 using chunk size 18432, batch size 384 [2024-06-04 13:22:28.279] [info] cuda:0 using chunk size 9216, batch size 384 [2024-06-04 13:22:28.727] [error] CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8ce188b9b7 in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8cdae10115 in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f8ce1855958 in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: + 0xa9f1683 (0x7f8ce1844683 in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: + 0xa9f6f64 (0x7f8ce1849f64 in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: + 0xa9f893e (0x7f8ce184b93e in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #6: + 0xa9f8cce (0x7f8ce184bcce in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: + 0x89b54b5 (0x7f8cdf8084b5 in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #8: + 0x89befa9 (0x7f8cdf811fa9 in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #9: + 0x89bf354 (0x7f8cdf812354 in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #10: + 0x89c00db (0x7f8cdf8130db in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #11: + 0x89a54ca (0x7f8cdf7f84ca in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #12: at::native::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, bool) + 0x96 (0x7f8cdf7f8b16 in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #13: + 0xa632127 (0x7f8ce1485127 in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #14: + 0xa6321e0 (0x7f8ce14851e0 in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #15: at::_ops::cudnn_convolution::call(at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, bool) + 0x23d (0x7f8cdc37ae9d in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #16: at::native::_convolution(at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long, bool, bool, bool, bool) + 0x1505 (0x7f8cdb739cf5 in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #17: + 0x58a7496 (0x7f8cdc6fa496 in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #18: + 0x58a7517 (0x7f8cdc6fa517 in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #19: at::_ops::_convolution::call(at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRefc10::SymInt, c10::ArrayRef, bool, c10::ArrayRefc10::SymInt, long, bool, bool, bool, bool) + 0x29b (0x7f8cdbf1d0fb in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #20: at::native::convolution(at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long) + 0x21d (0x7f8cdb72dd3d in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #21: + 0x58a6f55 (0x7f8cdc6f9f55 in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #22: + 0x58a6fbf (0x7f8cdc6f9fbf in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #23: at::_ops::convolution::call(at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRefc10::SymInt, c10::ArrayRef, bool, c10::ArrayRefc10::SymInt, long) + 0x223 (0x7f8cdbf1c443 in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #24: at::native::conv1d(at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long) + 0x1c5 (0x7f8cdb730f35 in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #25: + 0x5a57b31 (0x7f8cdc8aab31 in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #26: at::_ops::conv1d::call(at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long) + 0x20c (0x7f8cdc37898c in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #27: torch::nn::Conv1dImpl::forward(at::Tensor const&) + 0x3a0 (0x7f8cdee84a40 in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #28: dorado() [0xa1949e]
frame #29: dorado() [0xa30ce6]
frame #30: dorado() [0xa0a732]
frame #31: dorado() [0xa41a99]
frame #32: dorado() [0xa41bc8]
frame #33: dorado() [0xa40b4b] frame #34: dorado() [0x96be01] frame #35: dorado() [0x875868] frame #36: dorado() [0x8310ab] frame #37: + 0x114df (0x7f8cd69f54df in /lib/x86_64-linux-gnu/libpthread.so.0) frame #38: dorado() [0x875c7f] frame #39: dorado() [0x837e00] frame #40: + 0x1196e380 (0x7f8ce87c1380 in /opt/apps/dorado/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so) frame #41: + 0x8609 (0x7f8cd69ec609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #42: clone + 0x43 (0x7f8cd5e46353 in /lib/x86_64-linux-gnu/libc.so.6)

fehofman avatar Jun 04 '24 12:06 fehofman

@fehofman do you think you could update your drivers? 450 is from 2020 https://endoflife.date/nvidia

iiSeymour avatar Jun 04 '24 12:06 iiSeymour

@fehofman do you think you could update your drivers? 450 is from 2020 https://endoflife.date/nvidia

Hi there, thanks for the suggestion. System was upgraded to : NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2. Unfortunately, basecalling still fails unless batch size set manually.

  • meaning dorado basecaller sup,m6A,pseU ./UHRR_pod5_test/ -b 320 --device "cuda:0,cuda:1" > ./test_UHRR_output/test.bam -> finishes basecalling
  • dorado basecaller sup,m6A,pseU ./UHRR_pod5_test/ --device "cuda:0,cuda:1" > ./test_UHRR_output/test.bam -> crashes with [2024-06-04 19:16:41.199] [error] Cuda error: an illegal memory access was encountered And if I may suggest updating the minimum supported driver versions on the github front page (we usually hesitate to update unless required and initially dismissed updating as the front page still lists 450 for Ubuntu. Thanks!)

charlotte-ht avatar Jun 04 '24 17:06 charlotte-ht

Hi, I am having the same error for dorado 0.7.1 on linux-x64, I tried "-b 320 --chunksize 9996" and it is still failing. I am using a system with T1000 and A800 (NVIDIA-SMI: 545.23.08, Driver Version: 545.23.08, CUDA Version: 12.3)

Any suggestions?

caballero avatar Jun 17 '24 09:06 caballero

Hi there, thanks for the suggestion. System was upgraded to : NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2. Unfortunately, basecalling still fails unless batch size set manually.

* meaning `dorado basecaller sup,m6A,pseU ./UHRR_pod5_test/  -b 320 --device "cuda:0,cuda:1" > ./test_UHRR_output/test.bam` -> finishes basecalling

* `dorado basecaller sup,m6A,pseU ./UHRR_pod5_test/  --device "cuda:0,cuda:1" > ./test_UHRR_output/test.bam` -> crashes with `[2024-06-04 19:16:41.199] [error] Cuda error: an illegal memory access was encountered`
  And if I may suggest updating the minimum supported driver versions on the github front page (we usually hesitate to update unless required and initially dismissed updating as the front page still lists 450 for Ubuntu. Thanks!)

@fehofman do you think you could update your drivers? 450 is from 2020 https://endoflife.date/nvidia

Hi there, thanks for the suggestion. System was upgraded to : NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2. Unfortunately, basecalling still fails unless batch size set manually.

* meaning `dorado basecaller sup,m6A,pseU ./UHRR_pod5_test/  -b 320 --device "cuda:0,cuda:1" > ./test_UHRR_output/test.bam` -> finishes basecalling

* `dorado basecaller sup,m6A,pseU ./UHRR_pod5_test/  --device "cuda:0,cuda:1" > ./test_UHRR_output/test.bam` -> crashes with `[2024-06-04 19:16:41.199] [error] Cuda error: an illegal memory access was encountered`
  And if I may suggest updating the minimum supported driver versions on the github front page (we usually hesitate to update unless required and initially dismissed updating as the front page still lists 450 for Ubuntu. Thanks!)

I had the same issue for a single dataset, using the -b 320 option seemed to fix it

brambloemen avatar Jun 28 '24 09:06 brambloemen

Hi there, thanks for the suggestion. System was upgraded to : NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2. Unfortunately, basecalling still fails unless batch size set manually.

* meaning `dorado basecaller sup,m6A,pseU ./UHRR_pod5_test/  -b 320 --device "cuda:0,cuda:1" > ./test_UHRR_output/test.bam` -> finishes basecalling

* `dorado basecaller sup,m6A,pseU ./UHRR_pod5_test/  --device "cuda:0,cuda:1" > ./test_UHRR_output/test.bam` -> crashes with `[2024-06-04 19:16:41.199] [error] Cuda error: an illegal memory access was encountered`
  And if I may suggest updating the minimum supported driver versions on the github front page (we usually hesitate to update unless required and initially dismissed updating as the front page still lists 450 for Ubuntu. Thanks!)

@fehofman do you think you could update your drivers? 450 is from 2020 https://endoflife.date/nvidia

Hi there, thanks for the suggestion. System was upgraded to : NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2. Unfortunately, basecalling still fails unless batch size set manually.

* meaning `dorado basecaller sup,m6A,pseU ./UHRR_pod5_test/  -b 320 --device "cuda:0,cuda:1" > ./test_UHRR_output/test.bam` -> finishes basecalling

* `dorado basecaller sup,m6A,pseU ./UHRR_pod5_test/  --device "cuda:0,cuda:1" > ./test_UHRR_output/test.bam` -> crashes with `[2024-06-04 19:16:41.199] [error] Cuda error: an illegal memory access was encountered`
  And if I may suggest updating the minimum supported driver versions on the github front page (we usually hesitate to update unless required and initially dismissed updating as the front page still lists 450 for Ubuntu. Thanks!)

I had the same issue for a single dataset, using the -b 320 option seemed to fix it

Same issue and same, -b 320 sorted it

JadeFor avatar Jul 01 '24 12:07 JadeFor

Dorado 0.8.0 contains improvements to the auto batch size calculation and how memory is allocated when using modbase models. Further improvements will be made as we're continually working on the stability of the software.

As many contributors have shared, the general solution for GPU memory issues is to reduce the --batchsize to limit the amount of memory allocated on the GPU. However, If these issues persist please re-open or create a new issue.

Kind regards, Rich

HalfPhoton avatar Sep 17 '24 11:09 HalfPhoton