dorado 0.7.0 RNA004 modbase calling results in CUDA error

Open pre-mRNA opened this issue 9 months ago • 12 comments

Issue Report

Please describe the issue:

Hi, I've been using Dorado for some time and recently tried the updated v5 models with Dorado v0.7.0 for RNA004 basecalling.

When I basecall reads using the new RNA004 SUP models with combined m6A and pseudoU modbase calling, Dorado aborts basecalling due to a reported CUDA error.

The issue does not occur if I perform SUP basecalling without also requesting m6A and pseudoU modification calls.

Steps to reproduce the issue:

I use the command:

./dorado basecaller -v sup,m6A,pseU $data_dir --estimate-poly-a > ${wd}/${name}_dorado_polyA_m6A_unmapped.bam

Run environment:

Dorado version: Dorado v0.7.0
Dorado command:

./dorado basecaller sup,m6A,pseU $data_dir --estimate-poly-a > ${wd}/${name}_dorado_polyA_m6A_unmapped.bam

Operating system: RHEL 8.10 Linux x86_64
Hardware (CPUs, Memory, GPUs): 4 x Tesla V100 32GB GPUs
Source data type: POD5
Source data location: Networked Lustre filesystem
Details about data: 3M RNA004 reads, generated on Promethion

Logs

(base) [user@gadi-gpu-v100-0100 bin]$ ./dorado basecaller -v sup,m6A,pseU $data_dir --estimate-poly-a > ${wd}/${name}_dorado_polyA_m6A_unmapped.bam
[2024-05-25 13:39:37.825] [info] Running: "basecaller" "-v" "sup,m6A,pseU" "/data/20240328_2254_P2S-00909-B_PAU42445_0a472f31/pod5/" "--estimate-poly-a"
[2024-05-25 13:39:38.839] [debug] Found existing simplex model: [email protected]
[2024-05-25 13:39:38.839] [debug] Found existing modification model: [email protected]_m6A@v1
[2024-05-25 13:39:38.839] [debug] Found existing modification model: [email protected]_pseU@v1
[2024-05-25 13:39:38.855] [info] Normalised: overlap 500 -> 492
[2024-05-25 13:39:38.855] [info] > Creating basecall pipeline
[2024-05-25 13:39:38.855] [debug] CRFModelConfig { qscale:1.200000 qbias:2.000000 stride:6 bias:1 clamp:0 out_features:4096 state_len:5 outsize:4096 blank_score:0.000000 scale:1.000000 num_features:1 sample_rate:4000 mean_qscore_start_pos:60 SignalNormalisationParams { strategy:pa StandardisationScalingParams { standardise:1 mean:80.875900 stdev:17.269760}} BasecallerParams { chunk_size:18432 overlap:492 batch_size:0} convs: { 0: ConvParams { insize:1 size:64 winlen:5 stride:1 activation:swish} 1: ConvParams { insize:64 size:64 winlen:5 stride:1 activation:swish} 2: ConvParams { insize:64 size:128 winlen:9 stride:3 activation:swish} 3: ConvParams { insize:128 size:128 winlen:9 stride:2 activation:swish} 4: ConvParams { insize:128 size:512 winlen:5 stride:2 activation:swish}} model_type: tx { crf_encoder: CRFEncoderParams { insize:512 n_base:4 state_len:5 scale:5.000000 blank_score:2.000000 expand_blanks:1 permute:1} transformer: TxEncoderParams { d_model:512 nhead:8 depth:18 dim_feedforward:2048 deepnorm_alpha:2.449490}}}
[2024-05-25 13:39:39.694] [info]  - BAM format does not support `U`, so RNA output files will include `T` instead of `U` for all file types.
[2024-05-25 13:39:44.394] [debug] cuda:2 memory available: 33.33GB
[2024-05-25 13:39:44.394] [debug] cuda:2 memory limit 32.33GB

... trimmed for brevity ...

[2024-05-25 13:40:03.016] [debug] Load reads from file /data/20240328_2254_P2S-00909-B_PAU42445_0a472f31/pod5/PAU42445_0a472f31_ca3b749f_268.pod5
terminate called after throwing an instance of 'c10::Error'asecalling                                                                                                                                
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x14aa8c3039b7 in /g/data/lf10/as7425/apps/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x14aa85888115 in /g/data/lf10/as7425/apps/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x14aa8c2cd958 in /g/data/lf10/as7425/apps/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: <unknown function> + 0x897b516 (0x14aa8a246516 in /g/data/lf10/as7425/apps/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: c10::Stream::synchronize() const + 0x82 (0x14aa8c2e0de2 in /g/data/lf10/as7425/apps/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: ./dorado() [0xa59018]
frame #6: <unknown function> + 0x1196e380 (0x14aa93239380 in /g/data/lf10/as7425/apps/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: <unknown function> + 0x81ca (0x14aa810691ca in /lib64/libpthread.so.0)
frame #8: clone + 0x43 (0x14aa7fa2de73 in /lib64/libc.so.6)

Aborted

May 25 '24 03:05 pre-mRNA

dorado dorado copied to clipboard

dorado 0.7.0 RNA004 modbase calling results in CUDA error

Issue Report

Please describe the issue:

Steps to reproduce the issue:

Run environment:

Logs

dorado
dorado copied to clipboard