modkit icon indicating copy to clipboard operation
modkit copied to clipboard

Differentially methylated CpG Islands using single-site analysis

Open lance0499 opened this issue 4 months ago • 3 comments

Hi,

I am using modkit DMR to find differentially methylated CpG islands between two samples (treated vs untreated). I have successfully identified DMRs using CpG islands as the "region" file, however, this does not include p-values.

Having that said, I am aware of using single-site analysis that provides MAP p-values.

Here are my questions:

  • Single-site analysis does not take a region file as an input. Do you have any idea on how to filter out the CpG islands only after running the single-site analysis?
  • As far as I understand, a MAP p-value is provided for each base or single-site. How does one get one single p-value per CpG island - and not per CpG site?
  • When assigning MAP p-values does modkit DMR incorporate any multiple testing correction? Or does this have to be done post-analysis?
  • Are there any published articles using your DMR method that we can use as a reference?

Thank you very much for all the help and your valuable time put into this. Best regards.

lance0499 avatar Aug 27 '25 08:08 lance0499

Hello @lance0499, really sorry about the delay getting back to you. To your questions:

Single-site analysis does not take a region file as an input. Do you have any idea on how to filter out the CpG islands only after running the single-site analysis?

I would use bedtools intersect on the input bedMethyls to select only positions overlapping with CpG islands. Then run single-site analysis.

As far as I understand, a MAP p-value is provided for each base or single-site. How does one get one single p-value per CpG island - and not per CpG site?

Modkit doesn't output a p-value for regions. I don't think it's impossible to do, @Ge0rges suggested Kolmogorov-Smirnov as a potential test here. The two main reasons it is challenging to make a test is (1) regions can all be different sizes and you don't want p-values to be inversely correlated with region size and (2) some regions are so large that small differences in methylation will be significant since the test is over powered. The K-S test is the best suggestion I've heard so far. You could use --segment and see what fraction of sites are classified as different, or with the single-site analysis look at what fraction are significantly different. I appreciate that neither of these precisely get you what you asked for.

When assigning MAP p-values does modkit DMR incorporate any multiple testing correction? Or does this have to be done post-analysis?

There is no multiple testing correction, it's a raw p-value, so you have to do this post-analysis.

Are there any published articles using your DMR method that we can use as a reference?

This paper uses it. I'm working on some comparisons to DSS and other popular methods and I'll put them on the docs when they're done. I understand that you need to use a method that the community trusts. Hopefully a comparison does this, but I can't control who publishes with the method and who uses an older tool. The goal of modkit dmr is to be easy to use and explore your data. The statistics are simple enough that I believe they stand up on their own (i.e. you could do the same tests w/o Modkit).

ArtRand avatar Sep 03 '25 18:09 ArtRand

Shameless plug to say that my paper was very recently published and uses modkit DMR (Table 2). When I used the KS test, it was to compare identical regions between treatments (in my case whole contigs or genomes) which alleviates the p-value bias issue. I think comparing two different regions to each other is a challenge.

Ge0rges avatar Sep 06 '25 02:09 Ge0rges

Also for the record I've tried Epps-Singleton, Energy distance, and 1-wasserstein and they all correlate very well with KS. Epps-Singleton correlates slightly less in my case (~0.7 instead of ~0.9) which I find interesting (i.e. it is picking up on different features of the distribution).

Ge0rges avatar Sep 06 '25 02:09 Ge0rges