dorado crashes when calling sup modifications 4mC_5mC 6mA on 5090
New issue checks
- [x] I have read the Dorado Documentation.
- [x] I did not find an existing issue.
Dorado version
1.1.1
Dorado subcommand
Basecaller
The issue
Have experienced a few crashes now so here is the report. The basecaller starts up nicely and runs for quite a while before crashing. Nothing else is running on the computer.
[2025-09-26 13:50:19.756] [info] Running: "basecaller" "--recursive" "--device" "cuda:all" "sup" "./PBE71308/" "--modified-bases" "4mC_5mC" "6mA" "--resume-from" "PBE71308.dorado1.1.1.bm5.2.0_sup.sim.mod4mC_5mC_6mA.bam" [2025-09-26 13:50:20.260] [info] - downloading [email protected] with httplib [2025-09-26 13:50:22.560] [info] - downloading [email protected]_4mC_5mC@v1 with httplib [2025-09-26 13:50:23.348] [info] - downloading [email protected]_6mA@v1 with httplib [2025-09-26 13:50:24.185] [info] > Creating basecall pipeline [2025-09-26 13:50:25.687] [info] Using CUDA devices: [2025-09-26 13:50:25.687] [info] cuda:0 - NVIDIA GeForce RTX 5090 [2025-09-26 13:50:25.687] [info] cuda:1 - NVIDIA GeForce RTX 5090 [2025-09-26 13:50:27.018] [info] Calculating optimized batch size for GPU "NVIDIA GeForce RTX 5090" and model [email protected]. Full benchmarking will run for this device, which may take some time. [2025-09-26 13:50:27.104] [info] Calculating optimized batch size for GPU "NVIDIA GeForce RTX 5090" and model [email protected]. Full benchmarking will run for this device, which may take some time. [2025-09-26 13:50:28.452] [info] cuda:0 using chunk size 12288, batch size 224 [2025-09-26 13:50:28.452] [info] cuda:1 using chunk size 12288, batch size 288 [2025-09-26 13:50:28.603] [info] cuda:0 using chunk size 6144, batch size 448 [2025-09-26 13:50:28.638] [info] cuda:1 using chunk size 6144, batch size 288 [2025-09-26 13:50:28.848] [info] > Inspecting resume file... [2025-09-26 13:50:28.854] [info] Resuming from file PBE71308.dorado1.1.1.bm5.2.0_sup.sim.mod4mC_5mC_6mA.bam... [2025-09-26 14:48:59.831] [info] > 12231628 original read ids found in resume file. Koi RMSNorm residual: failed to set smem size 184324h:16m:12s] Basecalling [2025-09-27 00:51:55.231] [error] Koi tiled path failed 7 terminate called after throwing an instance of 'std::runtime_error' what(): Koi convolution (host_window_ntwc_f16) failed with in size 16
System specifications
ubuntu 24, 2x RTX5090, AMD Ryzen Threadripper 7960X 24-Cores, 192 GB RAM, SSD NVME
Hi @Kirk3gaard, We've been unable to reproduce this error locally.
Could you share logs from the other failed attempts you've seen? We'd like to see if the order / content of the error messages are consistent between failures.
Kind regards, Rich
We are also seeing this issue on A100 - I will ask my colleague Sonal to upload some details here @sonalhenson .
Following @mattloose comment, here's the error log I am getting (I've changed the paths). It dies very early on the process with us. This is on dorado v1.1.1.
[2025-10-15 11:28:16.058] [info] Running: "basecaller" "sup,4mC_5mC,6mA" "P1/20251002_1242_3G_PBE26247_042cc8cb" "--recursive" "--kit-name" "SQK-NBD114-24" "--reference" "ref.fasta" "-o" "outdir"
[2025-10-15 11:28:16.120] [info] - downloading [email protected] with httplib
[2025-10-15 11:28:23.712] [info] - downloading [email protected]_4mC_5mC@v1 with httplib
[2025-10-15 11:28:24.998] [info] - downloading [email protected]_6mA@v1 with httplib
[2025-10-15 11:28:26.533] [info] > Creating basecall pipeline
[2025-10-15 11:28:26.723] [info] Using CUDA devices:
[2025-10-15 11:28:26.723] [info] cuda:0 - NVIDIA A100 80GB PCIe MIG 1g.10gb
[2025-10-15 11:28:26.723] [info] cuda:1 - NVIDIA A100 80GB PCIe MIG 1g.10gb
[2025-10-15 11:28:26.804] [error] Koi convolution (host_window_ntwc_f16) failed with in size 16
I've tried varying pretty much all variables and tried using dorado versions 1.1.0 and 1.0.2 and wf-basecaller with 4mC_5mC and 6mA (separately) and I get the same error everytime.
Happy to share more information for troubleshooting if needed. Thanks
Hi @sonalhenson.
- Are you able to reproduce this error with different data or even a single read?
- Are you consistently seeing the error immediately?
- Does running on a single GPU give the same error?
- Are you able to share with us the very verbose logs using
-vv? - If you do consistently see the error immediately, can you create another set of verbose logs with the environment variable
CUDA_LAUNCH_BLOCKING=1set. This will take a long time to run as it slows GPU performance significantly and it might also prevent the error from happening. If you don't see anything after a 10 minutes when you normally see an error immediately just terminate dorado.
Could you also share your system specifications e.g. OS / architecture.
Kind regards, Rich
@HalfPhoton Thanks for taking time to look into this and getting back to me so quickly. I am testing the command on our A100 tower and it is seems to be running without issues for now. I was getting the error when running it on our HPC. I'll try a different GPU queue on the HPC in case it's the architecture of the other queue causing the issue.
@HalfPhoton just to update you that the error was resolved on a different HPC GPU queue.
Thanks for the update @sonalhenson,
If you're still able to use the problematic GPU queue, could you please generate the verbose logs so that we can gather more details for our investigation? It would be very much appreciated as we've been unable to replicate the issue locally.
Kind regards, Rich
@HalfPhoton I'm getting a different error now on the problematic GPU queue. See attached verbose log as requested.
I got this same new error from the GPU queue where basecalling succeeded but this queue managed to get past it.
@sonalhenson, this error relates to not being able to connect to the ONT CDN to download models.
[2025-10-16 14:19:57.123] [info] - downloading [email protected] with httplib
[2025-10-16 14:19:57.199] [error] Failed to download [email protected]: Could not establish connection
This is a separate issue and can be resolved by first downloading models and using the --models-directory argument. More documentation here.
Any updates on this issue?
Apologies for not being clearer in my previous post, the issue was resolved by changing to a different GPU queue.