modkit modkit summary shows unexpected proportion of methylation - is my methylation information being lost

When I run modkit summary on my samples I am getting a strange batch effect. I have two batches of samples:

1st batch re-basecalled with dorado standalone software v0.6.1 model [email protected] and parameter --modified-bases 5mC_5hmC

2nd batch basecalled from dorado server v7.3.9 basecall config dna_r10.4.1_e8.2_400bps_5khz_modbases_5hmc_5mc_cg_hac_prom.cfg

so the model should be essentially the same between the two batches.

I converted uBAMs to fastq with samtools fastq -T "*" I aligned to GRCh38 with minimap2 -t 32 -ax map-ont -y -Y --MD

When I run modkit summary on samples from the 1st batch I get an example output:

# bases             C 
# total_reads_used  10042 
# count_reads_C     10042 
# pass_threshold_C  0.7636719 
 base  code  pass_count  pass_frac    all_count  all_frac 
 C     -     354650      0.3519645    372751     0.33318585 
 C     h     26080       0.025882516  55114      0.049264 
 C     m     626900      0.622153     690883     0.61755013

When I run modkit summary on samples from the 2nd batch I get an example output:

# bases             C 
# total_reads_used  10042 
# count_reads_C     10042 
# pass_threshold_C  0.73828125 
 base  code  pass_count  pass_frac     all_count  all_frac 
 C     -     11116207    0.95703727    12023879   0.9322321 
 C     h     56736       0.0048846216  203618     0.015786855 
 C     m     442286      0.038078114   670449     0.05198107

I can see that all the 2nd batch of samples have virtually no 5mC being detected at all whereas it is the majority in the 1st batch. I would not expect global methylation patterns to differ on these scale (these are cancer samples from the same cohort and cancer type). When I investigate the fastq (grep ^@) and bam files (samtools view BAM_FILE | less) for the 2nd batch of samples, there are MM and ML tags present.

How can I investigate this further or work out where the issue lies?

Apr 15 '25 14:04 eesiribloom

Hello @eesiribloom,

A couple things to quickly check.

Run modkit modbam check-tags [-n 10000] on both samples to make sure the MM/ML tags made it across OK. Sometimes using minimap for alignment can loose or corrupt them. See if the headers look like this:

+------------+-------+
| tag_header | count |
+------------+-------+
| C+h?       | 30    |
| C+m?       | 30    |
+------------+-------+

or like this

+------------+-------+
| tag_header | count |
+------------+-------+
| C+h.       | 30    |
| C+m.       | 30    |
+------------+-------+

The latter means that you have the "all context" model which will make a C-modification call at all cytosines instead of just at CpGs. Obviously this will change the proportions.

Run modkit summary --region [region] --no-filtering on a section of the genome. It is possible that the sampling algorithm used in summary is introducing a bias.

Let me know what you find.

Apr 15 '25 15:04 ArtRand

Hi @ArtRand that makes a lot of sense, I did wonder that myself. I ran check-tags and I can see one sample has an all context model and the other must have a CpG model. Would this change the proportions so drastically to go from CpG only to global 5mC? So I could just restrict analysis to --CG motifs /dinucloetides for all samples then?

Apr 16 '25 11:04 eesiribloom

Would it be helpful if there was a similar --cpg option with the summary and sample-probs commands as there is with modkit pileup?

Apr 16 '25 11:04 eesiribloom

Hello @eesiribloom,

Would this change the proportions so drastically to go from CpG only to global 5mC?

Yes I think it would. In your first sample you're looking (summarizing) the methylation levels at just CpG dinucleotides, which to the best of my knowledge are the only sequences (in humans) that are eligible to be methylated. Whereas in the second sample you're looking at the superset of all Cytosines, most of which will not be in CpGs and thus we expect them to be unmodified.

So I could just restrict analysis to --CG motifs /dinucloetides for all samples then?

If you're focusing on CpG methylation (and differences) then using the various --cpg or --motif CG 0 flags and options makes sense. Often when I'm performing this kind of analysis I often use the --preset traditional option (docs here) that will ignore the 5hmC probabilities and combine the (+)-strand and (-)-strand calls of the CpGs as well.

Would it be helpful if there was a similar --cpg option with the summary and sample-probs commands as there is with modkit pileup?

That's a good idea. You can get an idea of that right now by running

modkit adjust-mods $modbam - --cpg | modkit summary - --no-filtering

But I agree that since it's in adjust-mods it should be doable inline in summary/sample-probs.

Apr 16 '25 15:04 ArtRand