threshold modification in modkit pileup

Open baibhav-bioinfo opened this issue 11 months ago • 1 comments

Hello, Appreciation for developing such helpful program.

I am working with DRS data and using "Dorado+modkit" for m6A modification analysis. I ran Dorado for modification basecalling, then used dorado aligner for genome mapping.

Now I am using modkit pileup to convert the bam information into bed file using following command.

$modkit pileup --threads 96 a1_DRACH_.bam a1_sup_DRACH_pileup.bed --log-filepath a1_sup_DRACH_pileup.log

calculated chunk size: 144, interval size 100000, processing 14400000 positions concurrently attempting to sample 10042 reads Using filter threshold 0.9238281 for A. Done, processed 1654243 rows. Processed ~8092362 reads and skipped ~41874 reads.

The filter threshold used here automatically as ~0.92, is this threshold same as "--modified-bases-threshold 0.92" in dorado while basecalling?

Jan 29 '25 15:01 baibhav-bioinfo

Hello @baibhav-bioinfo,

No this isn't the same. If you use --modified-bases-threshold 0.92 any canonical probability >=0.92 will be elided (i.e. not explicitly recorded). If a base bas a canonical probability <0.92 it will still be recorded. Whereas the filter threshold calculated by modkit is saying that any base modification probability <0.92 will be filtered out.

If you look at the SAM specification (page 7) it has the details on the "implicitly low probability of modification" calls - it's a bit in the weeds though. In general I would recommend using the default values in dorado.

Jan 30 '25 04:01 ArtRand