dorado icon indicating copy to clipboard operation
dorado copied to clipboard

Dorado0.8.0 lost lots of reads after rebasecalling

Open SimonChen1997 opened this issue 1 year ago • 2 comments

Issue Report

Please describe the issue:

The target base number of output fastq should be over 500M, which was true when using Dorado 0.6.0. However, when I used Dorado 0.8.0, the largest fastq file only had 2M bases.

Steps to reproduce the issue:

$dorado basecaller --recursive $model $input_scratch --modified-bases 6mA --kit-name SQK-NBD114-24 > $output_scratch_bam/ecoli_dna_exp_sta_6mA_sup.bam

$dorado demux --output-dir $output_scratch_demultiplex --kit-name SQK-NBD114-24 $output_scratch_bam/ecoli_dna_exp_sta_6mA_sup.bam

Run environment:

  • Dorado version: 0.8.0
  • Dorado command:

$dorado basecaller --recursive $model $input_scratch --modified-bases 6mA --kit-name SQK-NBD114-24 > $output_scratch_bam/ecoli_dna_exp_sta_6mA_sup.bam

$dorado demux --output-dir $output_scratch_demultiplex --kit-name SQK-NBD114-24 $output_scratch_bam/ecoli_dna_exp_sta_6mA_sup.bam

  • Operating system: Linux
  • Hardware (CPUs, Memory, GPUs): NVIDIA H100 PCIe
  • Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5
  • Source data location (on device or networked drive - NFS, etc.): on device
  • Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): SQK-NBD114-24

Logs

[2024-09-28 07:11:16.955] [info] Running: "basecaller" "--recursive" "/scratch/project/genoepic_rumen/dorado-0.8.0-linux-x64/data/[email protected]" "/scratch/project/genoepic_rumen/ecoli_dna_methyl/pod5" "--modified-bases" "6mA" "--kit-name" "SQK-NBD114-24" [2024-09-28 07:11:17.807] [warning] Unknown certs location for current distribution. If you hit download issues, use the envvar SSL_CERT_FILE to specify the location manually. [2024-09-28 07:11:17.813] [info] - downloading [email protected]_6mA@v2 with httplib [2024-09-28 07:11:17.877] [error] Failed to download [email protected]_6mA@v2: SSL server verification failed [2024-09-28 07:11:17.877] [info] - downloading [email protected]_6mA@v2 with curl % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed ^M 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0^M 23 18.4M 23 4375k 0 0 71.6M 0 --:--:-- --:--:-- --:--:-- 71.2M^M100 18.4M 100 18.4M 0 0 170M 0 --:--:-- --:--:-- --:--:-- 169M [2024-09-28 07:11:18.226] [info] > Creating basecall pipeline [2024-09-28 07:12:07.562] [warning] Unable to find chunk benchmarks for GPU "NVIDIA H100 PCIe", model /scratch/project/genoepic_rumen/dorado-0.8.0-linux-x64/data/[email protected] and chunk size 1728. Full benchmarking will run for this device, which may take some time. [2024-09-28 07:12:07.562] [warning] Unable to find chunk benchmarks for GPU "NVIDIA H100 PCIe", model /scratch/project/genoepic_rumen/dorado-0.8.0-linux-x64/data/[email protected] and chunk size 1728. Full benchmarking will run for this device, which may take some time. [2024-09-28 07:12:08.922] [info] cuda:0 using chunk size 12288, batch size 96 [2024-09-28 07:12:08.922] [info] cuda:1 using chunk size 12288, batch size 96 [2024-09-28 07:12:09.008] [info] cuda:0 using chunk size 6144, batch size 96 [2024-09-28 07:12:09.013] [info] cuda:1 using chunk size 6144, batch size 96 terminate called after throwing an instance of 'std::runtime_error' what(): Empty sequence and qstring provided for read id 39d5fcd5-ac11-48f5-acea-169a2736a9f0 /var/spool/slurmd/job10990506/slurm_script: line 33: 2837796 Aborted (core dumped) $dorado basecaller --recursive $model $input_scratch --modified-bases 6mA --kit-name SQK-NBD114-24 > $output_scratch_bam/ecoli_dna_exp_sta_6mA_sup.bam [2024-09-28 07:42:06.080] [info] Running: "demux" "--output-dir" "/scratch/project/genoepic_rumen/ecoli_dna_methyl_dorado_0_8/demultiplex_sup" "--kit-name" "SQK-NBD114-24" "/scratch/project/genoepic_rumen/ecoli_dna_methyl_dorado_0_8/bam_sup/ecoli_dna_exp_sta_6mA_sup.bam" [W::bam_hdr_read] EOF marker is absent. The input is probably truncated [2024-09-28 07:42:06.119] [info] num input files: 1 [W::bam_hdr_read] EOF marker is absent. The input is probably truncated [2024-09-28 07:42:06.382] [info] > starting barcode demuxing

SimonChen1997 avatar Sep 28 '24 05:09 SimonChen1997

Hi @SimonChen1997, It looks like the original base calling job crashed. This is why you have very little output.

terminate called after throwing an instance of 'std::runtime_error'
what(): Empty sequence and qstring provided for read id 39d5fcd5-ac11-48f5-acea-169a2736a9f0

It looks like you have a problematic read.

The demix job is also telling you there's something wrong with the base calling output

[W::bam_hdr_read] EOF marker is absent. The input is probably truncated

Best regards, Rich

HalfPhoton avatar Sep 28 '24 15:09 HalfPhoton

The demix job is also telling you there's something wrong with the base calling output

Hi,

Thanks for your reply. However, all the pod5 files can be successfully rebased using Dorado 0.6.0. Can I ask the reason for it?

Cheers, Ziming

SimonChen1997 avatar Sep 28 '24 15:09 SimonChen1997

This is presumably a variant on https://github.com/nanoporetech/dorado/issues/1020.

Also note: you are performing barcoding twice. You only need to specify --kit-name to either dorado basecaller or to dorado demux - your current command will lead to many unclassified reads as the barcodes will be trimmed after the first step. Since you are seeing this error, I suggest dropping it from the basecaller command, (and possibly adding --no-trim), then let dorado demux handle the barcoding and trimming.

malton-ont avatar Sep 30 '24 08:09 malton-ont

This is presumably a variant on #1020.

Also note: you are performing barcoding twice. You only need to specify --kit-name to either dorado basecaller or to dorado demux - your current command will lead to many unclassified reads as the barcodes will be trimmed after the first step. Since you are seeing this error, I suggest dropping it from the basecaller command, (and possibly adding --no-trim), then let dorado demux handle the barcoding and trimming.

Hi,

Thanks. I did use --no-trim after I posted the issue, and it worked. However, without adding --no-trim flag worked well for 0.6.0 version. Anyways, thanks for your reply. 😊

SimonChen1997 avatar Oct 01 '24 03:10 SimonChen1997

This issue should be resolved in dorado 0.8.1, which has just been released.

malton-ont avatar Oct 04 '24 12:10 malton-ont