Missing or outdated BAM metadata in dorado demux output
Hello,
I'm currently implementing demultiplexing for duplex reads and need to use dorado demux instead of the built-in demultiplexing provided by dorado basecaller. While comparing the output BAM metadata for the same read—generated using dorado demux --kit-name ... versus dorado basecaller ... --kit-name ... —I noticed that although the final basecalled sequence is identical, some metadata tags differ.
In particular, the quality score (qs) and duration (du) appear not to be updated after trimming when using dorado demux. I understand that dorado demux doesn’t have access to the POD5 file, so it's expected that it can't recompute the read duration (du). However, the quality score (qs) should not differ for the same sequence. I also noticed that the ns:i and ts:i tags are missing in the BAM output from dorado demux.
Is this expected behavior, or could it indicate an inconsistency in how metadata is handled during post-basecalling demultiplexing?
- Dorado version: 1.0.1+6af0d9a8
- Operating system: Ubuntu 20.04.6 LTS
- Source data type: pod5
I used a read.txt file with a single read id and a pod.5 file (containing the read id) as well as the same kit (SQK-NBD114-24). Demultiplexing during simplex basecalling:
dorado basecaller sup /data/ont_raw_data/SOL0029/pod5/PAY20342_b2d09f09_74745714_19.pod5 -l read.txt --kit-name SQK-NBD114-24 > temp.bam
samtools view temp.bam
Post-basecalling demultiplexing (dorado demux):
dorado basecaller sup /data/ont_raw_data/SOL0028/pod5/PAY20342_b2d09f09_74745714_19.pod5 -l read.txt --no-trim > not_trimmed_read.bam
dorado demux --kit-name SQK-NBD114-24 not_trimmed_read.bam --output-dir demux_read
samtools view demux_read/b2d09f09-3944-4632-9227-dda5e7fb2a50_SQK-NBD114-24_barcode03.bam
Hi @danielcav,
du is the duration of the full untrimmed read, so this should not be updated when trimmed. Recalculating ts and ns both require access to the move table to understand how many samples correspond to the bases removed - if you basecall with --emit-moves then this data will be available in the BAM file during demux and these values will be updated.
You are correct that the qs tag does not appear to be updated. This looks like an oversight on our part. I'll raise this internally - thanks for pointing it out! We are currently in the process of standardising our output formats and metadata, so hopefully we'll catch any further inconsistencies as we go.
Thank you for your prompt response!
I just had a quick follow-up: if the du tag isn’t meant to be updated, could you help me understand why we’re seeing two different values in the screenshots, for the same original read?
Hi @danielcav,
Apologies, I think I've confused things.
du is in fact the duration of the read signal from the start, but it does account for signal that is trimmed from the end (whether that is the intention...? I'll need to check). So du should correspond to ns / sample_rate (which you can see nicely in your first example). Since the second example above wasn't trimmed during basecalling, it has a larger du as no signal has been removed from the end. Without the move table, trimming during demux is unable to reduce the duration as it can't determine the bases/signal correlation (possibly we should also be dropping this value as well if we can't recalculate it and it is meant to remove trimmed signal. Again, I'll raise this discussion with the team.)