Inconsistent results when using call-mods + pileup or pileup directly
I noticed a discrepancy in the obtained calls when using in succession modkit pileup on a bam file obtained with modkit call-mods vs a pileup directly obtained from modkit pileup.
Using call-mods + pileup:
modkit call-mods -p 0.2 --mod-threshold a:0.8 input.bam called.bam
modkit pileup --no-filtering called.bam pileup_A.bed
Using pileup directly:
modkit pileup -p 0.2 --mod-threshold a:0.8 --no-filtering input.bam pileup_B.bed
I would naively expect the two resulting pileups to be identical or at least close in terms of reported %mA and yet average methylation differs significantly (>10% on my data, ~65% vs 78%). Perhaps I am missing something in the way modkit call-mods operates?
For completeness when using call-mods + pileup using the same filtering thresholds I get identical results as when using pileup with --no-filtering so:
modkit call-mods -p 0.2 --mod-threshold a:0.8 input.bam called.bam
modkit pileup -p 0.2 --mod-threshold a:0.8 called.bam pileup_C.bed
pileup_C is identical to pileup_A (but both are significantly different to pileup_B).
Hello @ppapasaikas,
I'm a little confused how you're generating pileup_B.bed. All of the most recent versions will not let you pass the --filter-percentile (-p) option along with the --no-filtering flag. Could you tell me what version of modkit you're using ($ modkit --version)? Also, are you using modkit summary to assess the %mA or pileup? (just so I can reproduce on my side).
Thank you for the quick reply. You are right, that was a copy pasting error on my side. For pileup_B (pileup only bed generation) it should be:
modkit pileup -p 0.2 --mod-threshold a:0.8 input.bam pileup_B.bed
I am using modkit v.0.2.3.
I am using directly the generated modkit pileup outputs in all cases to assess %mA.
Hello @ppapasaikas,
You may get different results from running modkit pileup with specified thresholds compared to first running modkit call-mods with specified thresholds then using pileup because the filter thresholds are estimated slightly differently. In pileup only mapped reads and mapped bases are used. Whereas in modkit call-mods unmapped reads and bases are used. You can see the difference in the logs:
src/command_utils.rs::117][2024-02-16 18:08:07][DEBUG] estimated pass threshold 0.8417969 for primary sequence base A
The exact line number might be different if you update to a more recent version of modkit. What I would suggest is running modkit sample-probs with --only-mapped to estimate the filter threshold then use this number for both other procedures (--filter-threshold ${sample_probs_output} --mod-threshold a:0.8).
I'll add an option to call-mods to make the threshold estimation exactly identical to pileup. If you confirm this is the case, (i.e. using user-provided thresholds makes the final outputs the same), I'll change the issue to reflect adding the flag to call-mods.
I followed the suggested procedure:
# [src/command_utils.rs::117][2024-02-16 21:05:56][DEBUG] estimated pass threshold 0.7324219 for primary sequence base A
Then created the two pileups:
modkit call-mods --filter-threshold 0.7324219 --mod-threshold a:0.8 input.bam called.bam
samtools index called.bam
modkit pileup --filter-threshold 0.7324219 --mod-threshold a:0.8 called.bam --only-tabs pileup1.bed
(Again for the above case in the second step replacing --filter-threshold 0.7324219 --mod-threshold a:0.8 with --no-filtering yields identical results.)
And
modkit pileup --filter-threshold 0.7324219 --mod-threshold a:0.8 input.bam --only-tabs pileup2.bed
The two resulting pileups are definitely not identical. Not only in terms of fraction modified but also Nvalid_cov and all other Nxxx columns of the bedmethyl file. The difference is not subtle either. pileup1 (the version going through call-mods) reports consistently higher Nvalid_cov and lower fraction modified.
In case you cannot replicate with an in-house dataset I can share a sample bam file.
Hello @ppapasaikas,
If you can share a BAM that exposes the issue that would help a lot and I'd be more than happy to take a look. Thanks!
@ppapasaikas
Any update, keen to investigate the problem if you're willing to share a BAM that exposes it.