Differing read count number in simplex and duplex basecalling
Hello, I ran simplex and duplex basecalling on the same dataset (dorado-0.4.1-linux-x64 with [email protected]). I was expecting to get same number of reads on the simplex basecalling and the duplex basecalling filtered with ( dx:i:0; dx:i:-1). However there seems to be some discrepancy on the reported read counts. I was wondering if this was expected and which simplex reads should be trusted (direct simplex basecalling or simplex filtered after duplex basecalling)? Thanks
| File | ReadCount | Tags included |
|---|---|---|
| simplex.bam | 11,130,442 | |
| duplex.bam | 13,130,264 | |
| filtered_duplex_only.bam | 1,966,966 | dx:i:1 |
| filtered_simplex_only.bam | 11,163,298 | dx:i:0; dx:i:-1 |
| filtered_simplex_NoDuplex_i0.bam | 7,980,822 | dx:i:0 |
| filtered_simplex_WithDuplex_i-1.bam | 3,182,476 | dx:i:-1 |
Hi @dpaudel-tb - we have slightly different read splitting configurations for simplex vs duplex basecalling. This can lead to a different number of reads being split in each case. That's most likely the root cause of this count discrepancy. So I would suggest you go with the dx:0 + dx:-1 simplex reads from the duplex run.
We'll look at harmonizing the options between the 2 cases.
Thank you @tijyojwad!