dorado duplex: CUDA error `c10_cuda_check_implementation`
Issue Report
Please describe the issue:
I am trying to run dorado duplex with the recommended approach to split up the pod5 in individual channels.
The error occurs while running dorado duplex on a singe channel .pod5. I use a 4khz HAC model and also want to call modifications.
All steps are run as part of a snakemake pipeline. The first job runs fine and produces a bamfile but has an error in the logfile and throws soft lockup kernel errors. All subsequent jobs, using the same run command and profiles stop with a CUDA error. The second CUDA error seems to be similar to the one reported by other people here (#849, #842) . I dont know if the first and second erros are related.
Steps to reproduce the issue:
Run multiple dorado duplex basecallings on split channel .pod5.
Run environment:
- Dorado version: 0.7.0
- Dorado command:
dorado-0.7.0-linux-x64/bin/dorado duplex \
--device 'cuda:all' --min-qscore 9 --verbose \
--modified-bases-models bin/dorado_models/[email protected]_5mCG_5hmCG@v2 \
bin/dorado_models/[email protected] \
/tmp/dorado_duplex.TEST_TUE_01/channel-1199.pod5 \
> /tmp/dorado_duplex.TEST_TUE_01/channel-1199.bam \
2> logs/dorado_duplex.TEST_TUE_01.channel-1199.log
- Operating system: Ubuntu 20.04.6 LTS
- Hardware (CPUs, Memory, GPUs): 2xH100, 1.5TB Ram, 192cores
- Source data type: POD5 split to individual channels
- Source data location (on device or networked drive - NFS, etc.): local
/tmpfolder - Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB):
- Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue): if requested
Logs
The first job in the pipeline has the following error but still produces a bamfile:
[2024-05-29 09:20:36.874] [info] > Starting Stereo Duplex pipeline
[2024-05-29 09:20:36.877] [debug] BasecallerNode chunk size 10000
[2024-05-29 09:20:36.895] [debug] Using dorado duplex channel count of 10
[2024-05-29 09:20:36.910] [debug] BasecallerNode chunk size 10000
[2024-05-29 09:20:36.921] [info] > Reading read channel info
[2024-05-29 09:20:36.922] [info] > Processed read channel info
[2024-05-29 09:20:36.924] [debug] Sort channel 2216
[2024-05-29 09:20:36.924] [debug] Sorted channel 2216
[2024-05-29 09:20:40.262] [error] basic_string::substr: __pos (which is 21065) > this->size() (which is 1604)
[2024-05-29 09:20:40.262] [error] basic_string::substr: __pos (which is 11643) > this->size() (which is 9568)
[2024-05-29 09:20:40.347] [info] > Simplex reads basecalled: 328
[2024-05-29 09:20:40.347] [info] > Simplex reads filtered: 57
[2024-05-29 09:20:40.347] [info] > Duplex reads basecalled: 35
[2024-05-29 09:20:40.347] [info] > Duplex rate: 21.222548%
[2024-05-29 09:20:40.347] [info] > Basecalled @ Bases/s: 1.039055e+05
Along with following kernel erros:
Message from syslogd@SRV010 at May 29 09:21:21 ...
kernel:[167914.357717] watchdog: BUG: soft lockup - CPU#117 stuck for 22s! [dorado:2186819]
Message from syslogd@SRV010 at May 29 09:21:49 ...
kernel:[167942.357288] watchdog: BUG: soft lockup - CPU#117 stuck for 22s! [dorado:2186819]
Message from syslogd@SRV010 at May 29 09:22:13 ...
kernel:[167966.352921] watchdog: BUG: soft lockup - CPU#115 stuck for 22s! [dorado:2186819]
All downstram jobs then fail with the following errors:
[2024-05-29 09:25:43.999] [debug] cuda:1 Decode memory 5.35GB
[2024-05-29 09:25:45.392] [error] finalise() not called on a HtsFile.
[2024-05-29 09:25:45.399] [error] CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8d113db9b7 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8d0a960115 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f8d113a5958 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: <unknown function> + 0x897dd13 (0x7f8d0f320d13 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: <unknown function> + 0x4530bc1 (0x7f8d0aed3bc1 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, c10::optional<c10::MemoryFormat>) + 0x14 (0x7f8d0aecd604 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #6: at::detail::empty_cpu(c10::ArrayRef<long>, c10::ScalarType, bool, c10::optional<c10::MemoryFormat>) + 0x40 (0x7f8d0aecd650 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: at::detail::empty_cpu(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x34 (0x7f8d0aecd6a4 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #8: at::native::empty_cpu(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x1f (0x7f8d0b5c8e0f in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #9: <unknown function> + 0x5703e31 (0x7f8d0c0a6e31 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #10: <unknown function> + 0x5703e8b (0x7f8d0c0a6e8b in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #11: at::_ops::empty_memory_format::redispatch(c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0xe7 (0x7f8d0bd396e7 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #12: <unknown function> + 0x56c718f (0x7f8d0c06a18f in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #13: at::_ops::empty_memory_format::call(c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x1b2 (0x7f8d0bd79922 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #14: bin/dorado-0.7.0-linux-x64/bin/dorado() [0x8ab83f]
frame #15: bin/dorado-0.7.0-linux-x64/bin/dorado() [0xa2ec3f]
frame #16: bin/dorado-0.7.0-linux-x64/bin/dorado() [0xa2d4a6]
frame #17: bin/dorado-0.7.0-linux-x64/bin/dorado() [0x86bdfd]
frame #18: bin/dorado-0.7.0-linux-x64/bin/dorado() [0x829ee3]
frame #19: bin/dorado-0.7.0-linux-x64/bin/dorado() [0x4ccd9d]
frame #20: __libc_start_main + 0xf3 (0x7f8d0589b083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: bin/dorado-0.7.0-linux-x64/bin/dorado() [0x7f7b57]
Hi @caspargross,
Your duplex command is malformed:
dorado duplex ... --modified-bases-models <mods_model> <simplex_model> <data>
This arrangement results in the simplex model being consumed by the --modified-bases-models argument which accepts multiple inputs.
To resolve, please put the simplex model and data before the optional arguments.
dorado duplex <simpelx_model> <data> ... --modified-bases-models <mods_model>
As for the substr error we'll take a look at this internally as it's come up in a couple of places.
Kind regards, Rich
Hi @HalfPhoton ,
thanks for your help. I changed the code as you suggested but it did not solve the issues. However I found out that the basecalling works fine using only one GPU. To me it seems that there is a problem with CUDA ressource allocation or deallocation when running duplex basecalling on multiple devices. For me the issue can be solved by parallelizing multiple jobs using only a single CUDA device. Best regards, Caspar
I am reopening this Issue because this problem is still not solved. I tried the following points:
- Use only one GPU
- Use dorado v.7.1.0
- Lower chunksize: this helped for some files, ~approx 20% still fail with chunksizes as low as 5000
- Lower batchsize: No changes, although observed memory usage goes down
- Disable additional methylation model in duplex calling
Here is one of the failing pod5 files: https://cloud.imgag.de/s/bgC6CPg7ciTf9df
Logfile:
[2024-06-06 15:49:52.953] [info] Running: "duplex" "bin/dorado_models/[email protected]" "/tmp/dorado_duplex.HG001_TUE_01/channel_mod-65.pod5" "--device" "cuda:1" "--min-qscore" "9" "--verbose" "--batchsize" "5000" "--chunksize" "9216"
[2024-06-06 15:49:52.988] [info] > No duplex pairs file provided, pairing will be performed automatically
[2024-06-06 15:49:52.989] [debug] > Reads to process: 255715
[2024-06-06 15:49:53.246] [info] Normalised: chunksize 9216 -> 9215
[2024-06-06 15:49:53.247] [info] Normalised: chunksize 9216 -> 9215
[2024-06-06 15:49:53.433] [debug] cuda:1 memory available: 84.15GB
[2024-06-06 15:49:53.433] [debug] cuda:1 memory limit 74.74GB
[2024-06-06 15:49:53.433] [debug] cuda:1 maximum safe estimated batch size at chunk size 9215 is 11648
[2024-06-06 15:49:53.433] [info] cuda:1 using chunk size 9215, batch size 5056
[2024-06-06 15:49:53.433] [debug] cuda:1 Model memory 21.54GB
[2024-06-06 15:49:53.433] [debug] cuda:1 Decode memory 10.83GB
[2024-06-06 15:49:54.170] [debug] cuda:1 memory available: 83.82GB
[2024-06-06 15:49:54.170] [debug] cuda:1 memory limit 40.91GB
[2024-06-06 15:49:54.170] [debug] cuda:1 maximum safe estimated batch size at chunk size 9215 is 13248
[2024-06-06 15:49:54.170] [info] cuda:1 using chunk size 9215, batch size 5056
[2024-06-06 15:49:54.170] [debug] cuda:1 Model memory 11.93GB
[2024-06-06 15:49:54.170] [debug] cuda:1 Decode memory 3.67GB
[2024-06-06 15:49:54.560] [error] finalise() not called on a HtsFile.
[2024-06-06 15:49:54.561] [error] CUDA error: invalid argument
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe0837aa9b7 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fe07cd2f115 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fe083774958 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: <unknown function> + 0x897dd13 (0x7fe0816efd13 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: <unknown function> + 0x4530bc1 (0x7fe07d2a2bc1 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, c10::optional<c10::MemoryFormat>) + 0x14 (0x7fe07d29c604 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #6: at::detail::empty_cpu(c10::ArrayRef<long>, c10::ScalarType, bool, c10::optional<c10::MemoryFormat>) + 0x40 (0x7fe07d29c650 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: at::detail::empty_cpu(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x34 (0x7fe07d29c6a4 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #8: at::native::empty_cpu(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x1f (0x7fe07d997e0f in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #9: <unknown function> + 0x5703e31 (0x7fe07e475e31 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #10: <unknown function> + 0x5703e8b (0x7fe07e475e8b in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #11: at::_ops::empty_memory_format::redispatch(c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0xe7 (0x7fe07e1086e7 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #12: <unknown function> + 0x56c718f (0x7fe07e43918f in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #13: at::_ops::empty_memory_format::call(c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x1b2 (0x7fe07e148922 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #14: bin/dorado-0.7.1-linux-x64/bin/dorado() [0x8b2b7f]
frame #15: bin/dorado-0.7.1-linux-x64/bin/dorado() [0xa3c03f]
frame #16: bin/dorado-0.7.1-linux-x64/bin/dorado() [0xa3aac6]
frame #17: bin/dorado-0.7.1-linux-x64/bin/dorado() [0x87492d]
frame #18: bin/dorado-0.7.1-linux-x64/bin/dorado() [0x82739a]
frame #19: bin/dorado-0.7.1-linux-x64/bin/dorado() [0x8310ab]
frame #20: <unknown function> + 0x114df (0x7fe0789144df in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #21: bin/dorado-0.7.1-linux-x64/bin/dorado() [0x826569]
frame #22: bin/dorado-0.7.1-linux-x64/bin/dorado() [0x837e00]
frame #23: <unknown function> + 0x1196e380 (0x7fe08a6e0380 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #24: <unknown function> + 0x8609 (0x7fe07890b609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #25: clone + 0x43 (0x7fe077d65353 in /lib/x86_64-linux-gnu/libc.so.6)
Execution command:
CUDA_LAUNCH_BLOCKING=1 \
bin/dorado-0.7.1-linux-x64/bin/dorado duplex \
bin/dorado_models/[email protected] \
/tmp/dorado_duplex.HG001_TUE_01/channel_mod-65.pod5 \
--device 'cuda:1' \
--min-qscore 9 \
--verbose \
--batchsize 5000 \
--chunksize 9216 \
> /tmp/dorado_duplex.HG001_TUE_01/channel_mod-65.bam \
2> logs/dorado_duplex.HG001_TUE_01.channel_mod-65.log
Update: It works with --chunksize 5000 and --batchsize 5000. I saw mentionned somewhere that changing batchsize has an impact on the basecalling results however.
I needed to reduce the chunksize further and further to basecall the last channel subsets. For the last file I had to use chunksize 2000 and batchsize 5000. GPU memory usage is only at 16% with those settings.
Apologies for not replying sooner. I'm pleased to see your issue is resolved.
For future reference, generally we recommend reducing batchsize over chunksize as changing batch size will have no effect of the output basecalls.
Kind regards, Rich