dorado
dorado copied to clipboard
dorado 0.7.0 RNA004 modbase calling results in CUDA error
Issue Report
Please describe the issue:
Hi, I've been using Dorado for some time and recently tried the updated v5 models with Dorado v0.7.0 for RNA004 basecalling.
When I basecall reads using the new RNA004 SUP models with combined m6A and pseudoU modbase calling, Dorado aborts basecalling due to a reported CUDA error.
The issue does not occur if I perform SUP basecalling without also requesting m6A and pseudoU modification calls.
Steps to reproduce the issue:
I use the command:
./dorado basecaller -v sup,m6A,pseU $data_dir --estimate-poly-a > ${wd}/${name}_dorado_polyA_m6A_unmapped.bam
Run environment:
- Dorado version: Dorado v0.7.0
- Dorado command:
./dorado basecaller sup,m6A,pseU $data_dir --estimate-poly-a > ${wd}/${name}_dorado_polyA_m6A_unmapped.bam
- Operating system: RHEL 8.10 Linux x86_64
- Hardware (CPUs, Memory, GPUs): 4 x Tesla V100 32GB GPUs
- Source data type: POD5
- Source data location: Networked Lustre filesystem
- Details about data: 3M RNA004 reads, generated on Promethion
Logs
(base) [user@gadi-gpu-v100-0100 bin]$ ./dorado basecaller -v sup,m6A,pseU $data_dir --estimate-poly-a > ${wd}/${name}_dorado_polyA_m6A_unmapped.bam
[2024-05-25 13:39:37.825] [info] Running: "basecaller" "-v" "sup,m6A,pseU" "/data/20240328_2254_P2S-00909-B_PAU42445_0a472f31/pod5/" "--estimate-poly-a"
[2024-05-25 13:39:38.839] [debug] Found existing simplex model: [email protected]
[2024-05-25 13:39:38.839] [debug] Found existing modification model: [email protected]_m6A@v1
[2024-05-25 13:39:38.839] [debug] Found existing modification model: [email protected]_pseU@v1
[2024-05-25 13:39:38.855] [info] Normalised: overlap 500 -> 492
[2024-05-25 13:39:38.855] [info] > Creating basecall pipeline
[2024-05-25 13:39:38.855] [debug] CRFModelConfig { qscale:1.200000 qbias:2.000000 stride:6 bias:1 clamp:0 out_features:4096 state_len:5 outsize:4096 blank_score:0.000000 scale:1.000000 num_features:1 sample_rate:4000 mean_qscore_start_pos:60 SignalNormalisationParams { strategy:pa StandardisationScalingParams { standardise:1 mean:80.875900 stdev:17.269760}} BasecallerParams { chunk_size:18432 overlap:492 batch_size:0} convs: { 0: ConvParams { insize:1 size:64 winlen:5 stride:1 activation:swish} 1: ConvParams { insize:64 size:64 winlen:5 stride:1 activation:swish} 2: ConvParams { insize:64 size:128 winlen:9 stride:3 activation:swish} 3: ConvParams { insize:128 size:128 winlen:9 stride:2 activation:swish} 4: ConvParams { insize:128 size:512 winlen:5 stride:2 activation:swish}} model_type: tx { crf_encoder: CRFEncoderParams { insize:512 n_base:4 state_len:5 scale:5.000000 blank_score:2.000000 expand_blanks:1 permute:1} transformer: TxEncoderParams { d_model:512 nhead:8 depth:18 dim_feedforward:2048 deepnorm_alpha:2.449490}}}
[2024-05-25 13:39:39.694] [info] - BAM format does not support `U`, so RNA output files will include `T` instead of `U` for all file types.
[2024-05-25 13:39:44.394] [debug] cuda:2 memory available: 33.33GB
[2024-05-25 13:39:44.394] [debug] cuda:2 memory limit 32.33GB
... trimmed for brevity ...
[2024-05-25 13:40:03.016] [debug] Load reads from file /data/20240328_2254_P2S-00909-B_PAU42445_0a472f31/pod5/PAU42445_0a472f31_ca3b749f_268.pod5
terminate called after throwing an instance of 'c10::Error'asecalling
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x14aa8c3039b7 in /g/data/lf10/as7425/apps/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x14aa85888115 in /g/data/lf10/as7425/apps/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x14aa8c2cd958 in /g/data/lf10/as7425/apps/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: <unknown function> + 0x897b516 (0x14aa8a246516 in /g/data/lf10/as7425/apps/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: c10::Stream::synchronize() const + 0x82 (0x14aa8c2e0de2 in /g/data/lf10/as7425/apps/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: ./dorado() [0xa59018]
frame #6: <unknown function> + 0x1196e380 (0x14aa93239380 in /g/data/lf10/as7425/apps/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: <unknown function> + 0x81ca (0x14aa810691ca in /lib64/libpthread.so.0)
frame #8: clone + 0x43 (0x14aa7fa2de73 in /lib64/libc.so.6)
Aborted