modkit summary shows unexpected proportion of methylation - is my methylation information being lost
When I run modkit summary on my samples I am getting a strange batch effect. I have two batches of samples:
1st batch re-basecalled with dorado standalone software v0.6.1
model [email protected]
and parameter --modified-bases 5mC_5hmC
2nd batch basecalled from dorado server v7.3.9
basecall config dna_r10.4.1_e8.2_400bps_5khz_modbases_5hmc_5mc_cg_hac_prom.cfg
so the model should be essentially the same between the two batches.
I converted uBAMs to fastq with samtools fastq -T "*"
I aligned to GRCh38 with minimap2 -t 32 -ax map-ont -y -Y --MD
When I run modkit summary on samples from the 1st batch I get an example output:
# bases C
# total_reads_used 10042
# count_reads_C 10042
# pass_threshold_C 0.7636719
base code pass_count pass_frac all_count all_frac
C - 354650 0.3519645 372751 0.33318585
C h 26080 0.025882516 55114 0.049264
C m 626900 0.622153 690883 0.61755013
When I run modkit summary on samples from the 2nd batch I get an example output:
# bases C
# total_reads_used 10042
# count_reads_C 10042
# pass_threshold_C 0.73828125
base code pass_count pass_frac all_count all_frac
C - 11116207 0.95703727 12023879 0.9322321
C h 56736 0.0048846216 203618 0.015786855
C m 442286 0.038078114 670449 0.05198107
I can see that all the 2nd batch of samples have virtually no 5mC being detected at all whereas it is the majority in the 1st batch. I would not expect global methylation patterns to differ on these scale (these are cancer samples from the same cohort and cancer type). When I investigate the fastq (grep ^@) and bam files (samtools view BAM_FILE | less) for the 2nd batch of samples, there are MM and ML tags present.
How can I investigate this further or work out where the issue lies?
Hello @eesiribloom,
A couple things to quickly check.
- Run
modkit modbam check-tags [-n 10000]on both samples to make sure the MM/ML tags made it across OK. Sometimes usingminimapfor alignment can loose or corrupt them. See if the headers look like this:
+------------+-------+
| tag_header | count |
+------------+-------+
| C+h? | 30 |
| C+m? | 30 |
+------------+-------+
or like this
+------------+-------+
| tag_header | count |
+------------+-------+
| C+h. | 30 |
| C+m. | 30 |
+------------+-------+
The latter means that you have the "all context" model which will make a C-modification call at all cytosines instead of just at CpGs. Obviously this will change the proportions.
- Run
modkit summary --region [region] --no-filteringon a section of the genome. It is possible that the sampling algorithm used insummaryis introducing a bias.
Let me know what you find.
Hi @ArtRand that makes a lot of sense, I did wonder that myself. I ran check-tags and I can see one sample has an all context model and the other must have a CpG model.
Would this change the proportions so drastically to go from CpG only to global 5mC?
So I could just restrict analysis to --CG motifs /dinucloetides for all samples then?
Would it be helpful if there was a similar --cpg option with the summary and sample-probs commands as there is with modkit pileup?
Hello @eesiribloom,
Would this change the proportions so drastically to go from CpG only to global 5mC?
Yes I think it would. In your first sample you're looking (summarizing) the methylation levels at just CpG dinucleotides, which to the best of my knowledge are the only sequences (in humans) that are eligible to be methylated. Whereas in the second sample you're looking at the superset of all Cytosines, most of which will not be in CpGs and thus we expect them to be unmodified.
So I could just restrict analysis to --CG motifs /dinucloetides for all samples then?
If you're focusing on CpG methylation (and differences) then using the various --cpg or --motif CG 0 flags and options makes sense. Often when I'm performing this kind of analysis I often use the --preset traditional option (docs here) that will ignore the 5hmC probabilities and combine the (+)-strand and (-)-strand calls of the CpGs as well.
Would it be helpful if there was a similar --cpg option with the summary and sample-probs commands as there is with modkit pileup?
That's a good idea. You can get an idea of that right now by running
modkit adjust-mods $modbam - --cpg | modkit summary - --no-filtering
But I agree that since it's in adjust-mods it should be doable inline in summary/sample-probs.