modkit icon indicating copy to clipboard operation
modkit copied to clipboard

Do you use `modkit dmr`?

Open ArtRand opened this issue 10 months ago • 6 comments

Hello everyone.

I'd like to know if you're having issues with modkit dmr either in the pair or multi variety.

If you're not using it (but you are doing some kind of differential methylation analysis), why not?

Are the outputs hard to interpret, not helpful, or not compatible to other methods? Is it too slow or (worse) are there bugs?

One thing that's on my immediate roadmap is to compare an open dataset to a published tool such as DSS. I'm also experimenting with a method to get p-values for regions so you could find significantly differentially methylated regions.

If you're using it and liking it, throw a :+1: on here for fun. But don't hold back if there are things that could be better. Of course I'm not promising I can get to all of them.

ArtRand avatar Jan 31 '25 22:01 ArtRand

I've been using DMR quite a bit and it has been fast and intuitive! Thanks to the devs for making Modkit a very user-friendly tool!

I do have two very minor questions that I couldn't really find answers to elsewhere. In both cases, I usually perform paired site specific analyses, such as:

modkit dmr pair \
-a sample1_rep1.bed.gz -a sample1_rep2.bed.gz \
-b sample2_rep1.bed.gz -b sample2_rep2.bed.gz \
-o DMR.bed \
--ref reference.fasta \
--base A --base T \
--min-valid-coverage 10
  1. When analyzing the outputs with balanced replicates, would you recommend always analyzing the balanced effect sizes and p-values (rather than the un-balanced/raw values)? The effect sizes seem to be agreeable b/w raw and balanced, but p-values agree less, see attached scatter plots below. I'm not sure if this is expected behavior or if something about my analysis may be off.

  2. This one is ever more minor. I often analyze modification mutants where the effects are quite strong and a substantial fraction of my p-values (balanced or raw) == 0. I realize the exact p-value past a certain point isn't very interesting/informative, but I was wondering whether the range of reporting could/should be expanded beyond ~1e-50? This would just allow me to not have a massive clump of points at a very similar -log10(p-value) on volcano plots and similar graphics. Again, extremely minor and not actually a Modkit issue.

Image

Image

Thanks a lot!

kylepalos avatar Feb 06 '25 18:02 kylepalos

@kylepalos Thanks for this!

When analyzing the outputs with balanced replicates, would you recommend always analyzing the balanced effect sizes and p-values (rather than the un-balanced/raw values)? The effect sizes seem to be agreeable b/w raw and balanced, but p-values agree less, see attached scatter plots below. I'm not sure if this is expected behavior or if something about my analysis may be off.

Let me take a look into this.

ArtRand avatar Feb 08 '25 01:02 ArtRand

I wanted to chime in here in support of this command. I have explored many different methods for quantifying how differently methylated a nucleotide is in different samples. I've typically looked for:

  • Significance metric (p-value)
  • Effect size metric
  • Whether methylation type is taken into account
  • On regions, whether position is taken into account
  • Whether methylation fraction is taken into account
  • Whether the test can be corrected for differences in coverage
  • Whether the test can take advantage of replicates
  • Whether the test can handle different number of replicates AND different coverage per replicate

This is a tall order. I started with modkit dmr went around the block a few (many) times, and finally have settled on modkit dmr pair which satisfies all these criteria. I think the tool you've developed is excellent. It is on my list of things to do on in the feature to explore a contribution integrating it into anvi'o (perhaps at a workshop in September).

~For me the only thing lacking is a robust study testing the command's output on a controlled dataset, perhaps including a benchmark to some other relevant statistical tests. Perhaps that will one day be conducted by a member of the scientific community.~ That's on your TODO! Great!

The output is perfectly suitable for me, I've built a small software suite that reads that data in along with other modkit outputs, genetic annotations, etc. to output relevant plots that allow for a nice analysis.

Thanks for developing it.

Ge0rges avatar Feb 23 '25 23:02 Ge0rges

Hello @ArtRand,

Thanks for developing this tool.

The documentation describes very well the columns of modkit dmr pair output, what I can't find is which metrics should we use to call a position deferentially modified between two samples.

In my case, I used dmr pair, and this is the command line : modkit dmr pair -a $bed1 -b $bed2 -o $out/dmr_ino_m6a.txt --ref $ref --base A --log-filepath $out/ino_m6a_dmr.log

I was wondering, which column to look at to decide if a position is modified or not. When I use the map-based p-value I get 23 positions with very low score (<20), and when I use the score column, only the first three positions have a significant map-based p-value .. This leaves me confused about the results. So what criteria do you advise us to use for dmr pair output ? and be sure that a position is deferentially modified between my two samples ?

Thanks in advance. Rania

rania-o avatar Mar 17 '25 15:03 rania-o

Hello @Ge0rges,

I'm curious about the way you analyze the outputs of modkit. Is your tool public and can be used by everyone ? If you're open to it, I'd love to see an example of the visualizations you're generating.

Thanks in advance, Rania

rania-o avatar Mar 17 '25 15:03 rania-o

@kylepalos,

In the comparison of the balanced vs "raw" MAP-based p-values. I think the off-diagonal points could be due to high apparent within-condition variance. I bet in most cases this is due to sampling bias, one sample has somewhat low coverage and happens to have a different %modification compared to the others in that sample. The "balanced" calculation will tend to up-weight this sample compared to the "raw" calculation which essentially is a coverage-weighted calculation. It might be interesting to look at these sites if they cluster together on the reference, I can also think about adding some within-condition variance metrics.

ArtRand avatar Mar 31 '25 22:03 ArtRand