modkit icon indicating copy to clipboard operation
modkit copied to clipboard

Filtering out significantly DMR sites from `modkit dmr` output.

Open ArnavBharti opened this issue 3 months ago • 15 comments

@ArtRand

I have two datasets - one for 5mC and one for 5mC 5hmC. I have run modkit dmr pair on both of them separately.

For finding significantly DMR sites in 5mC dataset I am using the following criteria:

  1. pct_a_sample > 50%
  2. pct_b_sample > 50%
  3. effect_size < 0 (I want to look at more methylation in experiment samples)
  4. score > 4 (4 came from looking at distribution of all scores in the output TSV)
  5. MAP-based p-value < 0.05.

For 5mC 5hmC dataset I wanted to ask, effect_size, MAP-based p-value and score are affected by both m and h. Is there a way to compute these values for h only. Does that make sense biologically and make sense according to the DMR algorithm used by modkit.

For effect size I was planning, instead of looking at sample_fraction_modified for a and b, use sample_percents for only h for a and b and calculate the difference. Can score and MAP-based p-value be adjusted similarly?

I have data for m already from 5mC and my goal with 5mC 5hmC dataset was to get DMR sites for h only.

Command used:

modkit dmr pair -a <sampleA_1.bed.gz> -a <sampleA_2.bed.gz> -b <sampleB_1.bed.gz> -b <sampleB_2.bed.gz> -o <output_prefix> --segments <segments_file> --ref <ref>.fasta --base C --threads 4 --log-filepath <log_file> --header

ArnavBharti avatar Sep 13 '25 12:09 ArnavBharti

Hello @ArnavBharti,

For 5mC 5hmC dataset I wanted to ask, effect_size, MAP-based p-value and score are affected by both m and h. Is there a way to compute these values for h only. Does that make sense biologically and make sense according to the DMR algorithm used by modkit.

Correct, the MAP-based p-value and effect size are "modified vs unmodified" metrics, so changes from 5mC to 5hmC will not be reflected in these metrics. I've thought about doing a "one-vs-all" comparison for occasions where there are multiple modifications.

It makes sense biologically to compare 5hmC vs (¬5hmC) but the only metric in dmr that reflects this change is the score. I think it is reasonable to want to use effect_size and MAP-based p-value here, it's just not exposed in the API. Let me give it a think and circle back.

For effect size I was planning, instead of looking at sample_fraction_modified for a and b, use sample_percents for only h for a and b and calculate the difference. Can score and MAP-based p-value be adjusted similarly?

Score already takes into account changes between base modification states, but doesn't have a "significance" interpretation. The command you have looks fine, the segments found by the HMM will account for changes in 5hmC levels.

ArtRand avatar Sep 16 '25 18:09 ArtRand

Hi @ArtRand,

Output of 5mC gives only m modifications. And its result - which nucleotides are differentially methylated on 5mC is what I want.

Similarly, I want to know which nucleotides are differentially methylated on 5hmC. For this I want to have the three columns - effect size, score and MAP-based p-value - only for h modifications not h and m which is currently in the software.

It should be as if instead of 5mC 5hmC it was only 5hmC based. The trouble is Guppy modified basecalling provided config for only 5mC 5hmC.

I think it is reasonable to want to use effect_size and MAP-based p-value here, it's just not exposed in the API

Will this form of analysis be in future release of modkit?

Or should I try to recreate the algorithm used for these columns on my own. That from the first glance at code seemed like a time taking task. It would be easier if there was a way to modify currently written software to support that.

Alternate Idea

What if from my pileup result I remove rows for m modifications? Currently for each nucleotide I have results similar to

Pv_Sal1_chr01	1387	1388	h	1	-	1387	1388	255,0,0	1 0.00 0 0 1 0 0 0 0
Pv_Sal1_chr01	1387	1388	m	1	-	1387	1388	255,0,0	1 100.00 1 0 0 0 0 0 0

If I remove m rows from here, and then run the modkit dmr pair command, would DMR then be based on only h?

ArnavBharti avatar Sep 20 '25 11:09 ArnavBharti

Hi @ArtRand ,

Is there any updates on this? It would be helpful to have your opinion.

ArnavBharti avatar Sep 24 '25 11:09 ArnavBharti

@ArnavBharti sorry about the delay

Will this form of analysis be in future release of modkit?

Or should I try to recreate the algorithm used for these columns on my own. That from the first glance at code seemed like a time taking task. It would be easier if there was a way to modify currently written software to support that.

Yes I'm going add this functionality as well as some other DMR enhancements. I can try to get you a test build with just this feature asap.

What if from my pileup result I remove rows for m modifications? ...

This won't work actually. Modkit checks that all of the bedMethyl records are present by making sure that valid_coverage = N-canonical + sum(N-modified) at each position.

ArtRand avatar Sep 24 '25 22:09 ArtRand

I have a similar problem, where dorado was run with 8 modifications. It would be great to be able to differentiate between effects! I understand that a higher score could mean both changes in overall modification levels and changes in modification type, am I wrong?

Say I'm looking at A, I have:

a - m6A 69426 - Am 17596 - I

I first filter for overall significance using balanced MAP p-value (and eventually balanced effect size), then check the score to try to differentiate between modifications. If 1 - exp(-score) is significant, given a certain threshold e.g. 0.5, then I assume that there is differences in modifications. Heuristically, to know which one, I'm looking at the different mod_percentages (sample A minus sample B), for each modification separately, and pick the largest effect.

I'm ultimately interested in differential modification status between conditions, so even if using the heuristic described above to filter out sites with non-significant differences in modification type, I still have to rely on the overall balanced MAP p-value and balanced effect size to report my results. It's like saying there is a significant difference in modification (of any kind) between two conditions, and that's the overall effect, but those sites are actually e.g. just m6A.

Does this make sense?

eboileau avatar Sep 29 '25 10:09 eboileau

Hello @eboileau!

I understand that a higher score could mean both changes in overall modification levels and changes in modification type, am I wrong?

Correct. The "score" is the log-likelihood ratio method docs here which allows for any number of modifications.

I first filter for overall significance using balanced MAP p-value (and eventually balanced effect size), then check the score to try to differentiate between modifications. If 1 - exp(-score) is significant, given a certain threshold e.g. 0.5, then I assume that there is differences in modifications. Heuristically, to know which one, I'm looking at the different mod_percentages (sample A minus sample B), for each modification separately, and pick the largest effect.

This seems reasonable as a way to explore the data. I have some ideas about how to derive a p-value for a score directly, but I need to try a few more experiments first.

I'm ultimately interested in differential modification status between conditions, so even if using the heuristic described above to filter out sites with non-significant differences in modification type, I still have to rely on the overall balanced MAP p-value and balanced effect size to report my results.

You lost me a little bit here. Are you saying that you use the algorithm described above to find sites, but you can't report on their significance because the MAP-based p-value/effect size doesn't reflect the change in intra-modification state?

I'll try and husstle and get a calculator of the MAP-based p-value out - just already going full speed on some pileup refactors..

ArtRand avatar Oct 01 '25 04:10 ArtRand

Hi @ArtRand Thanks for your quick reply.

Re last point, yeah... I guess that's what I was trying to say. Put it in another way, the score isn't really suitable as a measure of significance to report e.g. treatment vs. control, and similarly for the effect size that I calculate heuristically.

eboileau avatar Oct 01 '25 05:10 eboileau

Hi @ArtRand,

Can you please give as estimate of the build timeline, I will arrange my targets accordingly.

ArnavBharti avatar Oct 03 '25 15:10 ArnavBharti

@ArnavBharti I can get this to you next week.

ArtRand avatar Oct 07 '25 16:10 ArtRand

Hi @ArtRand,

Is there any update on the build?

ArnavBharti avatar Oct 16 '25 13:10 ArnavBharti

Hi @ArtRand,

I wanted to just clarify modkit dmr. If control has no methylation and the experiment sample has methylation at a particular position in sequence that will be considered DMR right?

PS. Any ETA on dmr enhancements in modkit?

ArnavBharti avatar Oct 24 '25 13:10 ArnavBharti

@ArnavBharti I can get this to you next week.

Hello @ArtRand

Is there any update on this build? It would be nice if this build could be available asap.

Thanks

Proy321 avatar Oct 25 '25 10:10 Proy321

Hello all, yes sorry for the delay.

ArtRand avatar Nov 05 '25 17:11 ArtRand

Hello @ArtRand

Is the build available now?

Thanks

Proy321 avatar Nov 06 '25 05:11 Proy321

Would be interested in this too. Thanks

jdcla avatar Nov 12 '25 16:11 jdcla