dorado icon indicating copy to clipboard operation
dorado copied to clipboard

Differing read count number in simplex and duplex basecalling

Open dpaudel-tb opened this issue 2 years ago • 2 comments

Hello, I ran simplex and duplex basecalling on the same dataset (dorado-0.4.1-linux-x64 with [email protected]). I was expecting to get same number of reads on the simplex basecalling and the duplex basecalling filtered with ( dx:i:0; dx:i:-1). However there seems to be some discrepancy on the reported read counts. I was wondering if this was expected and which simplex reads should be trusted (direct simplex basecalling or simplex filtered after duplex basecalling)? Thanks

File ReadCount Tags included
simplex.bam 11,130,442
duplex.bam 13,130,264
filtered_duplex_only.bam 1,966,966 dx:i:1
filtered_simplex_only.bam 11,163,298 dx:i:0; dx:i:-1
filtered_simplex_NoDuplex_i0.bam 7,980,822 dx:i:0
filtered_simplex_WithDuplex_i-1.bam 3,182,476 dx:i:-1

dpaudel-tb avatar Nov 15 '23 18:11 dpaudel-tb

Hi @dpaudel-tb - we have slightly different read splitting configurations for simplex vs duplex basecalling. This can lead to a different number of reads being split in each case. That's most likely the root cause of this count discrepancy. So I would suggest you go with the dx:0 + dx:-1 simplex reads from the duplex run.

We'll look at harmonizing the options between the 2 cases.

tijyojwad avatar Nov 16 '23 13:11 tijyojwad

Thank you @tijyojwad!

dpaudel-tb avatar Nov 16 '23 14:11 dpaudel-tb