Issue Report

Please describe the issue:

Basecalling of HG002 pod5s using dorado v0.7.2 as part of wf-basecalling results in termination of basecalling for some pod5s. This run has been previously basecalled multiple times successfully using dorado and wf-basecalling with v4 models. When cancelled I had 188/728 pod5s basecalled with 27 failures.

Individual pod5s were submitted for basecalling using A6000 and 3090 GPUs on slurm cluster. Failures were only observed on 3090 GPUs. Failures were observed across multiple nodes and not associated with any particular device or node. Failed pod5s were not associated with any particular time in the run (e.g. pod5s that failed include pod5 numbers 45, 194, 340, 478, 609, 611). Our pod5s have ~15,000 reads in them when checking the resulting bams.

Checking the batchsize that was being selected by dorado, I observe example passing:

[info] cuda:5 using chunk size 12288, batch size 480
[info] cuda:5 using chunk size 6144, batch size 480

example failing:

 [info] cuda:3 using chunk size 12288, batch size 256
 [info] cuda:3 using chunk size 6144, batch size 480

I resolved the issue by manually setting batchsize parameter to 256.

Suggests to me that automatic selection of optimal batchsize for the v5 model needs a tweak for 3090 24GB GPU.

Steps to reproduce the issue:

ran wf-basecalling using version v1.3.0 which wraps dorado v0.7.2

wf-basecalling

SAMPLENAME=HG002-hg38-v5
INPUT=/projects/sbs_primary1/promethion1/data/F134812/PAS57600/20231107_1024_5A_PAS57600_bc8ee63d/pod5
WORK=/projects/dmulder_prj/scratch/steve/nextflow/work
##
[email protected]
[email protected]_5mCG_5hmCG@v1
OUTPUT=/projects/dmulder_prj/scratch/steve/wf-basecalling/$SAMPLENAME
##
GENOME=/projects/alignment_references/Homo_sapiens/hg38_no_alt/genome/fasta/hg38_no_alt.fa
##
CONFIG=/projects/dmulder_prj/scratch/nextflow/production-workflows/configs/wf-basecalling.config
##export to fix libtorch issue with multiple libraries
export KMP_DUPLICATE_LIB_OK=TRUE
##set ulimit otherwise nextflow fails on the snvpileup
ulimit -u 100000
##
mkdir -p $OUTPUT
##
nextflow run epi2me-labs/wf-basecalling \
	-r v1.3.0 \
	-profile slurmgpu \
        -c $CONFIG \
	-w $WORK \
	--input $INPUT \
	--basecaller_cfg $MODEL \
	--remora_cfg $REMORA \
	--dorado_ext pod5 \
	--ref $GENOME \
	--sample_name $SAMPLENAME \
	--out_dir $OUTPUT \
	-resume

with config:

params {
    basecaller_chunk_size = 1
    cuda_device = "cuda:\${SLURM_JOB_GPUS}"
}


profiles {
    slurmgpu {
        singularity {
            enabled = true
            autoMounts = true
            envWhitelist = "SLURM_JOB_GPUS,KMP_DUPLICATE_LIB_OK"
        }
        process.executor = 'slurm'
        shell = ['/bin/bash', '-eo', 'pipefail']
        process."withLabel:gpu".queue = "gpu3090,gpuA6000"
        process."withLabel:gpu".memory = { 48.GB * task.attempt }
        process."withLabel:gpu".errorStrategy = { task.exitStatus in 137..140 ? 'retry' : 'ignore' }
        process."withLabel:gpu".maxRetries =3
        process."withLabel:gpu".clusterOptions = "--gres=gpu:1"
        process."withLabel:gpu".maxForks = null
        process."withLabel:gpu".containerOptions = "--nv"
        process."withLabel:wf_basecalling".queue = "cputesting"
        process."withLabel:wf_basecalling".memory = "64 GB"
        process."withLabel:wf_basecalling".cpus = 8
        process."withLabel:wf_common".queue = "cputesting"
        process."withLabel:wf_common".memory = "64 GB"
        process."withLabel:wf_common".cpus = 8
    }
}

Run environment:

Dorado version: 0.7.2
Dorado command: default from wf-basecalling
Operating system: slurm GPU cluster

NAME="Ubuntu"
VERSION="18.04.2 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.2 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Hardware (CPUs, Memory, GPUs): basecalling node: 8 cpus, 64GB memory, 1x 3090 GPU with 24GB memory or 1x A6000 GPU with 48GB memory.
Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5
Source data location (on device or networked drive - NFS, etc.): networked
Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): 5khz run, yielding around 100Gbase, 35X human genome. N50 ~10kb. pod5 footprint: 2TB

instrument=PC48B254
position=5A
flow_cell_id=PAS57600
sample_id=PAS57600
protocol_group_id=F134812
protocol=sequencing/sequencing_PRO114_DNA_e8_2_400K:FLO-PRO114M:SQK-LSK114-XL:400
protocol_run_id=bc8ee63d-5b88-45f2-adb5-fa15eb568da6
acquisition_run_id=3f415733bdb768fe6f02c05c359ff9341e981df5
started=2023-11-07T10:25:59.076080-08:00
acquisition_stopped=2023-11-11T10:26:01.328013-08:00
processing_stopped=2023-11-11T10:26:03.362334-08:00
basecalling_enabled=0
sequencing_summary_file=sequencing_summary_PAS57600_bc8ee63d_3f415733.txt
fast5_files_in_final_dest=0
fast5_files_in_fallback=0
pod5_files_in_final_dest=728
pod5_files_in_fallback=0
fastq_files_in_final_dest=0
fastq_files_in_fallback=0

Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue): we can share some failing pod5s including the following which failed: PAS57600_bc8ee63d_3f415733_45.pod5 PAS57600_bc8ee63d_3f415733_194.pod5 PAS57600_bc8ee63d_3f415733_303.pod5 PAS57600_bc8ee63d_3f415733_330.pod5 PAS57600_bc8ee63d_3f415733_340.pod5 PAS57600_bc8ee63d_3f415733_384.pod5 PAS57600_bc8ee63d_3f415733_392.pod5 PAS57600_bc8ee63d_3f415733_415.pod5 PAS57600_bc8ee63d_3f415733_459.pod5 PAS57600_bc8ee63d_3f415733_474.pod5 PAS57600_bc8ee63d_3f415733_475.pod5 PAS57600_bc8ee63d_3f415733_478.pod5 PAS57600_bc8ee63d_3f415733_514.pod5 PAS57600_bc8ee63d_3f415733_525.pod5 PAS57600_bc8ee63d_3f415733_550.pod5 PAS57600_bc8ee63d_3f415733_573.pod5 PAS57600_bc8ee63d_3f415733_597.pod5 PAS57600_bc8ee63d_3f415733_599.pod5 PAS57600_bc8ee63d_3f415733_608.pod5 PAS57600_bc8ee63d_3f415733_609.pod5 PAS57600_bc8ee63d_3f415733_611.pod5 PAS57600_bc8ee63d_3f415733_643.pod5 PAS57600_bc8ee63d_3f415733_650.pod5 PAS57600_bc8ee63d_3f415733_661.pod5 PAS57600_bc8ee63d_3f415733_669.pod5 PAS57600_bc8ee63d_3f415733_693.pod5 PAS57600_bc8ee63d_3f415733_720.pod5

Logs

from .command.log

INFO:    Converting SIF file to temporary sandbox...
WARNING: underlay of /etc/localtime required more than 50 (83) bind mounts
WARNING: underlay of /usr/bin/nvidia-smi required more than 50 (399) bind mounts
[2024-09-03 17:14:49.773] [info] Running: "basecaller" "/models/[email protected]" "." "--modified-bases-models" "/models/[email protected]_5mCG_5hmCG@v1" "--device" "cuda:3"
[2024-09-03 17:14:49.842] [info] > Creating basecall pipeline
[2024-09-03 17:15:03.211] [info] cuda:3 using chunk size 12288, batch size 256
[2024-09-03 17:15:03.933] [info] cuda:3 using chunk size 6144, batch size 480
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
Exception raised from createCublasHandle at /pytorch/pyold/aten/src/ATen/cuda/CublasHandlePool.cpp:18 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x14c2e616c9b7 in /home/epi2melabs/dorado/lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x14c2df6f1115 in /home/epi2melabs/dorado/lib/libdorado_torch_lib.so)
frame #2: <unknown function> + 0xa90879b (0x14c2e603c79b in /home/epi2melabs/dorado/lib/libdorado_torch_lib.so)
frame #3: at::cuda::getCurrentCUDABlasHandle() + 0xb65 (0x14c2e603e2b5 in /home/epi2melabs/dorado/lib/libdorado_torch_lib.so)
frame #4: <unknown function> + 0xa903be4 (0x14c2e6037be4 in /home/epi2melabs/dorado/lib/libdorado_torch_lib.so)
frame #5: <unknown function> + 0xa90dbf8 (0x14c2e6041bf8 in /home/epi2melabs/dorado/lib/libdorado_torch_lib.so)
frame #6: <unknown function> + 0xa915102 (0x14c2e6049102 in /home/epi2melabs/dorado/lib/libdorado_torch_lib.so)
frame #7: <unknown function> + 0xa617dd4 (0x14c2e5d4bdd4 in /home/epi2melabs/dorado/lib/libdorado_torch_lib.so)
frame #8: <unknown function> + 0xa617e6d (0x14c2e5d4be6d in /home/epi2melabs/dorado/lib/libdorado_torch_lib.so)
frame #9: at::_ops::addmm::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) + 0x1a1 (0x14c2e0822951 in /home/epi2melabs/dorado/lib/libdorado_torch_lib.so)
frame #10: torch::nn::LinearImpl::forward(at::Tensor const&) + 0xa3 (0x14c2e379af33 in /home/epi2melabs/dorado/lib/libdorado_torch_lib.so)
frame #11: dorado() [0xa77cca]
frame #12: dorado() [0xa7c9d8]
frame #13: dorado() [0xa6a2b0]
frame #14: dorado() [0xa6a418]
frame #15: dorado() [0xa66efb]
frame #16: <unknown function> + 0x1196e380 (0x14c2ed0a2380 in /home/epi2melabs/dorado/lib/libdorado_torch_lib.so)
frame #17: <unknown function> + 0x8609 (0x14c2db2cd609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #18: clone + 0x43 (0x14c2da725353 in /lib/x86_64-linux-gnu/libc.so.6)

Sep 04 '24 18:09 mulderdt

Hi @mulderdt, we're working on tuning the auto batchsize calculation to deliver both stability and performance which is challenging over many hardware variants.

You found the appropriate solution though which is to tune the batchsize yourself.

Does this mean you're issue is resolved?

Best regards, Rich

Sep 09 '24 10:09 HalfPhoton

I have same issue when running on only four POD5 files.

command:

dorado basecaller hac,5mCG_5hmCG /project/sli68423_1316/users/yang/aml_nanopore/results/2024_10_30_guppy_run_test/run_script_PT01/input_pod5/ -x auto --verbose -r --reference /project/sli68423_1316/users/yang/reference/chm13v2.0/chm13v2.0.fa --output-dir /project/sli68423_1316/users/yang/aml_nanopore/results/2024_10_30_guppy_run_test/run_script_PT01/dorado_calls_PT01

error message:

[2024-10-30 09:54:39.902] [info] Running: "basecaller" "hac,5mCG_5hmCG" "/project/sli68423_1316/users/yang/aml_nanopore/results/2024_10_30_guppy_run_test/run_script_PT01/input_pod5/" "-x" "auto" "--verbose" "-r" "--reference" "/project/sli68423_1316/users/yang/reference/chm13v2.0/chm13v2.0.fa" "--
output-dir" "/project/sli68423_1316/users/yang/aml_nanopore/results/2024_10_30_guppy_run_test/run_script_PT01/dorado_calls_PT01"
[2024-10-30 09:54:43.515] [info] Assuming cert location is /etc/ssl/certs/ca-bundle.crt
[2024-10-30 09:54:43.517] [info]  - downloading [email protected] with httplib
[2024-10-30 09:54:44.514] [info]  - downloading [email protected]_5mCG_5hmCG@v2 with httplib
[2024-10-30 09:54:45.484] [info] Normalised: chunksize 10000 -> 9996
[2024-10-30 09:54:45.484] [info] Normalised: overlap 500 -> 498
[2024-10-30 09:54:45.484] [info] > Creating basecall pipeline
[2024-10-30 09:54:45.485] [debug] CRFModelConfig { qscale:1.050000 qbias:-0.600000 stride:6 bias:0 clamp:1 out_features:-1 state_len:4 outsize:1024 blank_score:2.000000 scale:1.000000 num_features:1 sample_rate:5000 sample_type:DNA mean_qscore_start_pos:60 SignalNormalisationParams { strategy:pa StandardisationScalingParams { standardise:1 mean:93.692398 stdev:23.506744}} BasecallerParams { chunk_size:9996 overlap:498 batch_size:0} convs: { 0: ConvParams { insize:1 size:16 winlen:5 stride:1 activation:swish} 1: ConvParams { insize:16 size:16 winlen:5 stride:1 activation:swish} 2: ConvParams { insize:16 size:384 winlen:19 stride:6 activation:tanh}} model_type: lstm { bias:0 outsize:1024 blank_score:2.000000 scale:1.000000}}
[2024-10-30 09:54:49.136] [debug] cuda:0 memory available: 41.59GB
[2024-10-30 09:54:49.136] [debug] cuda:0 memory limit 39.78GB
[2024-10-30 09:54:49.136] [debug] cuda:0 maximum safe estimated batch size at chunk size 9996 is 5952
[2024-10-30 09:54:49.136] [debug] cuda:0 maximum safe estimated batch size at chunk size 4998 is 11968
[2024-10-30 09:54:49.136] [debug] Auto batchsize cuda:0: testing up to 10240 in steps of 64
[2024-10-30 09:54:49.136] [debug] Largest batch size for cuda:0: 6336, time per chunk 0.041739 ms
[2024-10-30 09:54:49.136] [debug] Final batch size for cuda:0[0]: 5760
[2024-10-30 09:54:49.136] [debug] Final batch size for cuda:0[1]: 6336
[2024-10-30 09:54:49.136] [info] cuda:0 using chunk size 9996, batch size 5760
[2024-10-30 09:54:49.136] [debug] cuda:0 Model memory 27.02GB
[2024-10-30 09:54:49.136] [debug] cuda:0 Decode memory 11.15GB
[2024-10-30 09:54:50.524] [info] cuda:0 using chunk size 4998, batch size 6336
[2024-10-30 09:54:50.524] [debug] cuda:0 Model memory 14.86GB
[2024-10-30 09:54:50.524] [debug] cuda:0 Decode memory 6.13GB
[2024-10-30 09:55:57.031] [debug] Loaded index with 25 target seqs
[2024-10-30 09:56:00.069] [debug] BasecallerNode chunk size 9996
[2024-10-30 09:56:00.069] [debug] BasecallerNode chunk size 4998
[2024-10-30 09:56:00.097] [debug] Load reads from file /project/sli68423_1316/users/yang/aml_nanopore/results/2024_10_30_guppy_run_test/run_script_PT01/input_pod5/pod5_five/PAO20051_pass_f9438981_ac3ed4cb_1001.pod5
[2024-10-30 09:56:01.317] [debug] Load reads from file /project/sli68423_1316/users/yang/aml_nanopore/results/2024_10_30_guppy_run_test/run_script_PT01/input_pod5/pod5_five/PAO20051_pass_f9438981_ac3ed4cb_1000.pod5
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
Exception raised from createCublasHandle at /pytorch/pyold/aten/src/ATen/cuda/CublasHandlePool.cpp:18 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f468aa3e9b7 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f4683fc3115 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: <unknown function> + 0xa90879b (0x7f468a90e79b in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: at::cuda::getCurrentCUDABlasHandle() + 0xb65 (0x7f468a9102b5 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: <unknown function> + 0xa903be4 (0x7f468a909be4 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: <unknown function> + 0xa90dbf8 (0x7f468a913bf8 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #6: <unknown function> + 0xa915102 (0x7f468a91b102 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: <unknown function> + 0xa617dd4 (0x7f468a61ddd4 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #8: <unknown function> + 0xa617e6d (0x7f468a61de6d in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #9: at::_ops::addmm::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) + 0x1a1 (0x7f46850f4951 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #10: torch::nn::LinearImpl::forward(at::Tensor const&) + 0xa3 (0x7f468806cf33 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #11: dorado() [0xad007a]
frame #12: dorado() [0xad4398]
frame #13: dorado() [0xac0370]
frame #14: dorado() [0xac04d8]
frame #15: dorado() [0xabc963]
frame #16: <unknown function> + 0x1196e380 (0x7f4691974380 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #17: <unknown function> + 0x7ea5 (0x7f467f046ea5 in /lib64/libpthread.so.0)
frame #18: clone + 0x6d (0x7f467e2bbb0d in /lib64/libc.so.6)

Aborted

Oct 30 '24 17:10 liuyang2006

@liuyang2006, reduce the batch size:

Auto batchsize used 6336. Try 5120

dorado basecaller <model> <reads> --batchsize 5120 ... > calls.bam

Oct 31 '24 10:10 HalfPhoton

May I ask why this value of 5120 specifically? That would be helpful to resolve in case other batchsizes are chosen automatically and the program crashes.

Thanks a lot :)

Nov 22 '24 13:11 claczny

@claczny, It's just about 10% less to reduce GPU memory.

Nov 23 '24 20:11 HalfPhoton

basecalling failing with v5 model on some pod5s with 3090 GPU (24GB) but not A6000 GPU (48GB)

Issue Report

Please describe the issue:

Steps to reproduce the issue:

Run environment:

Logs