modkit icon indicating copy to clipboard operation
modkit copied to clipboard

how to remove DMR segment output with low percentage of samples used in statistical test

Open eesiribloom opened this issue 2 months ago • 2 comments

When calling DMRs with the --segment option I notice a lot of overlap in my samples between two different tests. By that I mean I looked at differential methylation across different mutational events in the same cohort. I notice a large proportion of the DMRs for both overlap in the output segment file and from looking more I wonder if this comes from sites where only a small number of samples are included in the test.

e.g. from one test looking at DMRs for one test (control n =13, test n=7) chr21 5763718 5764316 different 54.711902912199605 9 m:1 m:56 m:1.89 m:90.32 0.018867925 0.9032258 -0.88435787 -2.233330110165479 1.8666703655187038 2.599989854812254

from another test looking at DMRs for a different condition (control n=10, test=10) chr21 5763718 5764316 different 54.711902912199605 9 m:56 m:1 m:90.32 m:1.89 0.9032258 0.018867925 0.88435787 2.233330110165479 1.8666703655187038 2.599989854812254

Is there a way to include only sites which consider at least x % of the samples in each condition similar to the pct_a_samples and pct_b_samples in the single site analysis?

eesiribloom avatar Oct 08 '25 18:10 eesiribloom

Hello @eesiribloom,

Just to make sure I'm understanding you correctly. What you're doing is running DMR with --segment and with various numbers of test and control cases. Some of the segments are reported using data from a subset of the cases, so when you change the input samples, these segments are unchanged since they aren't represented in the cases you're modulating. Correct?

Short answer is that there (currently) isn't a way to require that a segment contains data from >= x% of the samples. I can see why this would be helpful, but it's a feature I'd need to implement. Give me a few days to come up for air and I'll try and get you a test build.

ArtRand avatar Oct 15 '25 01:10 ArtRand

hi @ArtRand So Im coming at this from a cancer standpoint but this could apply generally to any cohort of samples. There might be multiple different A vs B conditions you might be interested in testing for DMRs among the same set of samples, like age, sex, a mutational signature, different molecular subtypes, or exposure to some environmental factor like smoking or asbestos. anything really.

If I understand correctly with the modkit dmr pair, with multiple samples, not every sample in each group is considered for a site to be output as a DMR?

If that is true then in this example: Say you have a cohort of 20. 10 males and 10 females 7 from molecular subtype A and 13 from molecular subtype B. Say 3 male and 3 female samples have a strong effect size /difference in methylation at a particular region, this might be identified as a DMR for sex, provided the cumulative coverage over that region is sufficient. But if those same pairs of samples also happen to fall into molecular subtype A and B respectively, it would also be output as a DMR for that. Essentially without being able to control what % of the samples are being used for the test, the DMRs have high sensitivity but low specificity. Hope this makes sense and that I've understood correctly.

I suppose you could increase the minimum coverage required for a DMR to be called but it would be helpful if you could at least get some metric to filter downstream for segments /DMRs such as the pct_a_samples

Hope that makes sense

eesiribloom avatar Oct 15 '25 21:10 eesiribloom