modkit Comparing between methylation patterns of different samples

Dear @ArtRand,

I used dorado aligner as you suggested to map each bam file containing information about methylation using nanopore sequencing. I then run modkit pileup on these bam files. My goal is to compare methylation patterns between two bacterial samples and to get statistically significant differences in methylation and their coordinates in the genome.

I have several questions, I would highly appreciate your feedback:

I have several bam files for each sample (between 100 to 200 files). I understand that for each individual raw read a corresponding non-alignment bam file was produced. How do you suggest to handle these multiple bam files per sample? Should I combine them for each sample and then run midkit pileup?
I run modkit pileup with default parameters. In the output bedmethyl file in most of the rows I have zeros in the percent_modified and count_modified columns and also zeros in the count_other_mode,count_delete,count_fail,count_diff,count_nocall columns. What does this mean? How do I get the statistically significant methylated coordinates?
Once I wish to compare between 2 samples, should I normalize the bam files or the output bedmethyl files somehow? How do you suggest to compare between samples systematically? I compared by eye using IGV but since I have that many bam files it is not informative to compare only several of them.

Thank you very much in advance.

All the best, Raya

Oct 09 '24 12:10 Raya-Faigenbaum-Romm

Hello @Raya-Faigenbaum-Romm,

I have several bam files for each sample (between 100 to 200 files). I understand that for each individual raw read a corresponding non-alignment bam file was produced. How do you suggest to handle these multiple bam files per sample? Should I combine them for each sample and then run midkit pileup?

I hope you don't have one BAM file per read. But yes I suggest combining them with samtools merge, sort the BAM, make an index, then run modkit pileup.

I run modkit pileup with default parameters. In the output bedmethyl file in most of the rows I have zeros in the percent_modified and count_modified columns and also zeros in the count_other_mode,count_delete,count_fail,count_diff,count_nocall columns. What does this mean? How do I get the statistically significant methylated coordinates?

The pileup command doesn't compute which positions are "significantly methylated", it only performs a counting of how many reads report each base modification status at each genomic position. If you have very low methylation in your sample, most of the records will be zero as you've observed. If you want to filter to positions that are $\geq X%$ methylated, I suggest using awk or similar.

awk -v X=${X} '($11>X)' pileup.bed > pileup_gt${X}.bed

Once I wish to compare between 2 samples, should I normalize the bam files or the output bedmethyl files somehow? How do you suggest to compare between samples systematically? I compared by eye using IGV but since I have that many bam files it is not informative to compare only several of them.

If you have two samples, you can use modkit dmr pair. You don't need to do any normalization ahead of time. First you need the pileup bedmethyl files from each sample. Then you need to compress them with bgzip and index them with tabix. The HTS-lib website has instructions on how to install these tools. The documentation has example commands and the output schema.

Oct 10 '24 00:10 ArtRand

Hello @ArtRand,

Thank you very much for your quick and detailed answer. It is very helpful for me. Which value do you suggest to filter the 11th column in the awk command that you suggested? I try to understand what is considered as a methylated position.

All the best, Raya

Oct 10 '24 15:10 Raya-Faigenbaum-Romm

Hello @ArtRand ,

I have a follow-up question, I run modkit dmr pair. I do not understand how can I know which positions statistically significantly differ in their methylation between the two samples? I ordered the output file by the 5th column (score). Is there some cutoff that you suggest? The question that I wish to answer is: do the samples differ in their methylation pattern? and if yes, in which positions?

Thank you very much for the great documentation and help, Raya

Oct 14 '24 16:10 Raya-Faigenbaum-Romm

Hello @Raya-Faigenbaum-Romm,

Sorry about the delay.

Which value do you suggest to filter the 11th column in the awk command that you suggested? I try to understand what is considered as a methylated position.

This is really hard to say for sure without knowing more about your experiment, but here's a guess. Since you're looking at bacterial samples, my (non-expert) advice would be $\geq$ 90% modified would be a "modified site". Another option is, you can use modkit motif search and evaluate to find modified sequences and investigate their modification rates in different samples. For example, you could do this on both samples and then compare which motifs are found in either one.

I have a follow-up question, I run modkit dmr pair. I do not understand how can I know which positions statistically significantly differ in their methylation between the two samples? I ordered the output file by the 5th column (score). Is there some cutoff that you suggest?

If you've run "single-site analysis" (send me the command if you're unsure) you can use the MAP-based p-value. I usually use 0.01 as my threshold, it's also the default value in the segmentation HMM.

The question that I wish to answer is: do the samples differ in their methylation pattern? and if yes, in which positions?

If you're working in bacteria, I would try using the motif search functionality as well as the dmr pair function with --segment ${segment_fp} argument.

Oct 15 '24 21:10 ArtRand

Hello @ArtRand,

Thank you very much for your explanation. It is very helpful for me and I enjoy how fast modkit runs.

All the best, Raya

Oct 22 '24 16:10 Raya-Faigenbaum-Romm

You bet, happy to help.

Oct 22 '24 17:10 ArtRand

@ArtRand

Regarding the filtering of positions based on percent_modified, I have set a threshold of >20%. Is there any published or commonly accepted justification for using this threshold?

Secondly, when percent_modified is 100%, there can be two different scenarios:

valid_coverage: 6, count_modified: 6

valid_coverage: 1, count_modified: 1

This suggests that simply filtering by percent_modified > 20% may introduce bias, as it does not account for coverage depth.

May 07 '25 08:05 Tang-pro