dorado icon indicating copy to clipboard operation
dorado copied to clipboard

Missing or outdated BAM metadata in dorado demux output

Open danielcav opened this issue 6 months ago • 3 comments

Hello,

I'm currently implementing demultiplexing for duplex reads and need to use dorado demux instead of the built-in demultiplexing provided by dorado basecaller. While comparing the output BAM metadata for the same read—generated using dorado demux --kit-name ... versus dorado basecaller ... --kit-name ... —I noticed that although the final basecalled sequence is identical, some metadata tags differ.

In particular, the quality score (qs) and duration (du) appear not to be updated after trimming when using dorado demux. I understand that dorado demux doesn’t have access to the POD5 file, so it's expected that it can't recompute the read duration (du). However, the quality score (qs) should not differ for the same sequence. I also noticed that the ns:i and ts:i tags are missing in the BAM output from dorado demux.

Is this expected behavior, or could it indicate an inconsistency in how metadata is handled during post-basecalling demultiplexing?

  • Dorado version: 1.0.1+6af0d9a8
  • Operating system: Ubuntu 20.04.6 LTS
  • Source data type: pod5

I used a read.txt file with a single read id and a pod.5 file (containing the read id) as well as the same kit (SQK-NBD114-24). Demultiplexing during simplex basecalling:

dorado basecaller sup /data/ont_raw_data/SOL0029/pod5/PAY20342_b2d09f09_74745714_19.pod5 -l read.txt --kit-name SQK-NBD114-24 > temp.bam
samtools view temp.bam

Image

Post-basecalling demultiplexing (dorado demux):

dorado basecaller sup /data/ont_raw_data/SOL0028/pod5/PAY20342_b2d09f09_74745714_19.pod5 -l read.txt --no-trim > not_trimmed_read.bam
dorado demux --kit-name SQK-NBD114-24 not_trimmed_read.bam --output-dir demux_read
samtools view demux_read/b2d09f09-3944-4632-9227-dda5e7fb2a50_SQK-NBD114-24_barcode03.bam

Image

danielcav avatar Jun 26 '25 14:06 danielcav

Hi @danielcav,

du is the duration of the full untrimmed read, so this should not be updated when trimmed. Recalculating ts and ns both require access to the move table to understand how many samples correspond to the bases removed - if you basecall with --emit-moves then this data will be available in the BAM file during demux and these values will be updated.

You are correct that the qs tag does not appear to be updated. This looks like an oversight on our part. I'll raise this internally - thanks for pointing it out! We are currently in the process of standardising our output formats and metadata, so hopefully we'll catch any further inconsistencies as we go.

malton-ont avatar Jun 27 '25 08:06 malton-ont

Thank you for your prompt response!

I just had a quick follow-up: if the du tag isn’t meant to be updated, could you help me understand why we’re seeing two different values in the screenshots, for the same original read?

danielcav avatar Jun 27 '25 11:06 danielcav

Hi @danielcav,

Apologies, I think I've confused things.

du is in fact the duration of the read signal from the start, but it does account for signal that is trimmed from the end (whether that is the intention...? I'll need to check). So du should correspond to ns / sample_rate (which you can see nicely in your first example). Since the second example above wasn't trimmed during basecalling, it has a larger du as no signal has been removed from the end. Without the move table, trimming during demux is unable to reduce the duration as it can't determine the bases/signal correlation (possibly we should also be dropping this value as well if we can't recalculate it and it is meant to remove trimmed signal. Again, I'll raise this discussion with the team.)

malton-ont avatar Jun 27 '25 14:06 malton-ont