dorado icon indicating copy to clipboard operation
dorado copied to clipboard

dorado crashes when calling sup modifications 4mC_5mC 6mA on 5090

Open Kirk3gaard opened this issue 3 months ago • 11 comments

New issue checks

Dorado version

1.1.1

Dorado subcommand

Basecaller

The issue

Have experienced a few crashes now so here is the report. The basecaller starts up nicely and runs for quite a while before crashing. Nothing else is running on the computer.

[2025-09-26 13:50:19.756] [info] Running: "basecaller" "--recursive" "--device" "cuda:all" "sup" "./PBE71308/" "--modified-bases" "4mC_5mC" "6mA" "--resume-from" "PBE71308.dorado1.1.1.bm5.2.0_sup.sim.mod4mC_5mC_6mA.bam" [2025-09-26 13:50:20.260] [info] - downloading [email protected] with httplib [2025-09-26 13:50:22.560] [info] - downloading [email protected]_4mC_5mC@v1 with httplib [2025-09-26 13:50:23.348] [info] - downloading [email protected]_6mA@v1 with httplib [2025-09-26 13:50:24.185] [info] > Creating basecall pipeline [2025-09-26 13:50:25.687] [info] Using CUDA devices: [2025-09-26 13:50:25.687] [info] cuda:0 - NVIDIA GeForce RTX 5090 [2025-09-26 13:50:25.687] [info] cuda:1 - NVIDIA GeForce RTX 5090 [2025-09-26 13:50:27.018] [info] Calculating optimized batch size for GPU "NVIDIA GeForce RTX 5090" and model [email protected]. Full benchmarking will run for this device, which may take some time. [2025-09-26 13:50:27.104] [info] Calculating optimized batch size for GPU "NVIDIA GeForce RTX 5090" and model [email protected]. Full benchmarking will run for this device, which may take some time. [2025-09-26 13:50:28.452] [info] cuda:0 using chunk size 12288, batch size 224 [2025-09-26 13:50:28.452] [info] cuda:1 using chunk size 12288, batch size 288 [2025-09-26 13:50:28.603] [info] cuda:0 using chunk size 6144, batch size 448 [2025-09-26 13:50:28.638] [info] cuda:1 using chunk size 6144, batch size 288 [2025-09-26 13:50:28.848] [info] > Inspecting resume file... [2025-09-26 13:50:28.854] [info] Resuming from file PBE71308.dorado1.1.1.bm5.2.0_sup.sim.mod4mC_5mC_6mA.bam... [2025-09-26 14:48:59.831] [info] > 12231628 original read ids found in resume file. Koi RMSNorm residual: failed to set smem size 184324h:16m:12s] Basecalling [2025-09-27 00:51:55.231] [error] Koi tiled path failed 7 terminate called after throwing an instance of 'std::runtime_error' what(): Koi convolution (host_window_ntwc_f16) failed with in size 16

System specifications

ubuntu 24, 2x RTX5090, AMD Ryzen Threadripper 7960X 24-Cores, 192 GB RAM, SSD NVME

Kirk3gaard avatar Sep 28 '25 11:09 Kirk3gaard

Hi @Kirk3gaard, We've been unable to reproduce this error locally.

Could you share logs from the other failed attempts you've seen? We'd like to see if the order / content of the error messages are consistent between failures.

Kind regards, Rich

HalfPhoton avatar Oct 02 '25 12:10 HalfPhoton

We are also seeing this issue on A100 - I will ask my colleague Sonal to upload some details here @sonalhenson .

mattloose avatar Oct 15 '25 12:10 mattloose

Following @mattloose comment, here's the error log I am getting (I've changed the paths). It dies very early on the process with us. This is on dorado v1.1.1.

[2025-10-15 11:28:16.058] [info] Running: "basecaller" "sup,4mC_5mC,6mA" "P1/20251002_1242_3G_PBE26247_042cc8cb" "--recursive" "--kit-name" "SQK-NBD114-24" "--reference" "ref.fasta" "-o" "outdir" [2025-10-15 11:28:16.120] [info] - downloading [email protected] with httplib
[2025-10-15 11:28:23.712] [info] - downloading [email protected]_4mC_5mC@v1 with httplib
[2025-10-15 11:28:24.998] [info] - downloading [email protected]_6mA@v1 with httplib
[2025-10-15 11:28:26.533] [info] > Creating basecall pipeline [2025-10-15 11:28:26.723] [info] Using CUDA devices: [2025-10-15 11:28:26.723] [info] cuda:0 - NVIDIA A100 80GB PCIe MIG 1g.10gb [2025-10-15 11:28:26.723] [info] cuda:1 - NVIDIA A100 80GB PCIe MIG 1g.10gb [2025-10-15 11:28:26.804] [error] Koi convolution (host_window_ntwc_f16) failed with in size 16

I've tried varying pretty much all variables and tried using dorado versions 1.1.0 and 1.0.2 and wf-basecaller with 4mC_5mC and 6mA (separately) and I get the same error everytime.

Happy to share more information for troubleshooting if needed. Thanks

sonalhenson avatar Oct 15 '25 12:10 sonalhenson

Hi @sonalhenson.

  • Are you able to reproduce this error with different data or even a single read?
  • Are you consistently seeing the error immediately?
  • Does running on a single GPU give the same error?
  • Are you able to share with us the very verbose logs using -vv?
  • If you do consistently see the error immediately, can you create another set of verbose logs with the environment variable CUDA_LAUNCH_BLOCKING=1 set. This will take a long time to run as it slows GPU performance significantly and it might also prevent the error from happening. If you don't see anything after a 10 minutes when you normally see an error immediately just terminate dorado.

Could you also share your system specifications e.g. OS / architecture.

Kind regards, Rich

HalfPhoton avatar Oct 15 '25 13:10 HalfPhoton

@HalfPhoton Thanks for taking time to look into this and getting back to me so quickly. I am testing the command on our A100 tower and it is seems to be running without issues for now. I was getting the error when running it on our HPC. I'll try a different GPU queue on the HPC in case it's the architecture of the other queue causing the issue.

sonalhenson avatar Oct 15 '25 14:10 sonalhenson

@HalfPhoton just to update you that the error was resolved on a different HPC GPU queue.

sonalhenson avatar Oct 16 '25 11:10 sonalhenson

Thanks for the update @sonalhenson,

If you're still able to use the problematic GPU queue, could you please generate the verbose logs so that we can gather more details for our investigation? It would be very much appreciated as we've been unable to replicate the issue locally.

Kind regards, Rich

HalfPhoton avatar Oct 16 '25 12:10 HalfPhoton

@HalfPhoton I'm getting a different error now on the problematic GPU queue. See attached verbose log as requested.

I got this same new error from the GPU queue where basecalling succeeded but this queue managed to get past it.

dorado_4620611.err.txt

sonalhenson avatar Oct 16 '25 13:10 sonalhenson

@sonalhenson, this error relates to not being able to connect to the ONT CDN to download models.

[2025-10-16 14:19:57.123] [info]  - downloading [email protected] with httplib
[2025-10-16 14:19:57.199] [error] Failed to download [email protected]: Could not establish connection

This is a separate issue and can be resolved by first downloading models and using the --models-directory argument. More documentation here.

HalfPhoton avatar Oct 16 '25 16:10 HalfPhoton

Any updates on this issue?

HalfPhoton avatar Nov 06 '25 11:11 HalfPhoton

Apologies for not being clearer in my previous post, the issue was resolved by changing to a different GPU queue.

sonalhenson avatar Nov 06 '25 12:11 sonalhenson