modkit Inconsistent results when using call-mods + pileup or pileup directly

I noticed a discrepancy in the obtained calls when using in succession modkit pileup on a bam file obtained with modkit call-mods vs a pileup directly obtained from modkit pileup.

Using call-mods + pileup:

modkit call-mods -p 0.2 --mod-threshold a:0.8  input.bam called.bam
modkit pileup --no-filtering called.bam pileup_A.bed

Using pileup directly:

modkit pileup  -p 0.2 --mod-threshold a:0.8  --no-filtering input.bam pileup_B.bed

I would naively expect the two resulting pileups to be identical or at least close in terms of reported %mA and yet average methylation differs significantly (>10% on my data, ~65% vs 78%). Perhaps I am missing something in the way modkit call-mods operates?

Feb 16 '24 11:02 ppapasaikas

For completeness when using call-mods + pileup using the same filtering thresholds I get identical results as when using pileup with --no-filtering so:

modkit call-mods -p 0.2 --mod-threshold a:0.8  input.bam called.bam
modkit pileup -p 0.2 --mod-threshold a:0.8 called.bam pileup_C.bed

pileup_C is identical to pileup_A (but both are significantly different to pileup_B).

Feb 16 '24 12:02 ppapasaikas

Hello @ppapasaikas,

I'm a little confused how you're generating pileup_B.bed. All of the most recent versions will not let you pass the --filter-percentile (-p) option along with the --no-filtering flag. Could you tell me what version of modkit you're using ($ modkit --version)? Also, are you using modkit summary to assess the %mA or pileup? (just so I can reproduce on my side).

Feb 16 '24 15:02 ArtRand

Thank you for the quick reply. You are right, that was a copy pasting error on my side. For pileup_B (pileup only bed generation) it should be:

modkit pileup -p 0.2 --mod-threshold a:0.8 input.bam pileup_B.bed

I am using modkit v.0.2.3.

I am using directly the generated modkit pileup outputs in all cases to assess %mA.

Feb 16 '24 15:02 ppapasaikas

Hello @ppapasaikas,

You may get different results from running modkit pileup with specified thresholds compared to first running modkit call-mods with specified thresholds then using pileup because the filter thresholds are estimated slightly differently. In pileup only mapped reads and mapped bases are used. Whereas in modkit call-mods unmapped reads and bases are used. You can see the difference in the logs:

src/command_utils.rs::117][2024-02-16 18:08:07][DEBUG] estimated pass threshold 0.8417969 for primary sequence base A

The exact line number might be different if you update to a more recent version of modkit. What I would suggest is running modkit sample-probs with --only-mapped to estimate the filter threshold then use this number for both other procedures (--filter-threshold ${sample_probs_output} --mod-threshold a:0.8).

I'll add an option to call-mods to make the threshold estimation exactly identical to pileup. If you confirm this is the case, (i.e. using user-provided thresholds makes the final outputs the same), I'll change the issue to reflect adding the flag to call-mods.

Feb 16 '24 18:02 ArtRand

I followed the suggested procedure:

# [src/command_utils.rs::117][2024-02-16 21:05:56][DEBUG] estimated pass threshold 0.7324219 for primary sequence base A Then created the two pileups:

modkit call-mods  --filter-threshold 0.7324219 --mod-threshold a:0.8  input.bam called.bam
samtools index called.bam
modkit pileup  --filter-threshold 0.7324219 --mod-threshold a:0.8 called.bam --only-tabs pileup1.bed

(Again for the above case in the second step replacing --filter-threshold 0.7324219 --mod-threshold a:0.8 with --no-filtering yields identical results.)

And

modkit pileup --filter-threshold 0.7324219 --mod-threshold a:0.8 input.bam --only-tabs pileup2.bed

The two resulting pileups are definitely not identical. Not only in terms of fraction modified but also Nvalid_cov and all other Nxxx columns of the bedmethyl file. The difference is not subtle either. pileup1 (the version going through call-mods) reports consistently higher Nvalid_cov and lower fraction modified.

In case you cannot replicate with an in-house dataset I can share a sample bam file.

Feb 16 '24 20:02 ppapasaikas

Hello @ppapasaikas,

If you can share a BAM that exposes the issue that would help a lot and I'd be more than happy to take a look. Thanks!

Feb 17 '24 15:02 ArtRand

@ppapasaikas

Any update, keen to investigate the problem if you're willing to share a BAM that exposes it.

Mar 14 '24 14:03 ArtRand