modkit icon indicating copy to clipboard operation
modkit copied to clipboard

Unable to change maximum coverage in dmr pair

Open ZhihanNUS opened this issue 8 months ago • 5 comments

Hello @ArtRand ,

Since the latest version of Modkit allows to increrase maximum coverage in dmr pair, I just put --max-coverages into my analysis and ran it again.

modkit dmr pair \ -a S1_m6A_DRACH.bed.gz \ -a S2_m6A_DRACH.bed.gz \ -a S3_m6A_DRACH.bed.gz \ -a S4_m6A_DRACH.bed.gz \ -b T1_m6A_DRACH.bed.gz \ -b T2_m6A_DRACH.bed.gz \ -b T3_m6A_DRACH.bed.gz \ -b T4_m6A_DRACH.bed.gz \ -o ${dmr_result} \ --header \ --ref ${ref} \ --base A \ --max-coverages 250 250 \ --threads 36 \ --log-filepath ${logfile}

However, in the log I find

running single-site analysis using default prior, Beta(α: 0.55, β: 0.55) using specified max coverage values control 250 and experiment 250 calculated max coverage 1000 is greater than maximum allowed (100), setting to 100 calculated max coverage 1000 is greater than maximum allowed (100), setting to 100 running with replicates and matched samples

And the MAP-p value from this run was also increased compared with the previous run using the old version which allows maximum 300. So it is bascially opposite to my idea to increase read usage. Is there anything wrong with my code?

Thanks!

ZhihanNUS avatar Apr 22 '25 16:04 ZhihanNUS

Hello @ZhihanNUS,

In version v0.4.4 I changed the maximum allowed coverage for the MAP-based p-value from 300 to 100. After some additional experiments, I decided that beyond coverage of 100 is not useful for the calculation. Essentially, at valid coverage $\gt 100$ even very small effect sizes are significant. To provide an alternative metric, I added Cohen's h to all outputs as well as the 95% confidence interval upper and lower bounds. The idea is that when you have very high coverage on single sites (or many sites as you would with regions) that it is more useful to know how meaningful the difference is. I like to use the cohen_h_low as my ranking metric when I have differentially methylated regions or very high coverage sites.

There is nothing wrong with your command.

If you need the maximum to be reverted to 300 for your analysis, I can give you a build that reverts that change.

ArtRand avatar Apr 23 '25 14:04 ArtRand

Hi @ArtRand ,

Thank you very much for the answer.

If I understand correctly, for my experiment setting, which is 4 samples vs 4 samples, and with high maximum coverage, I should focus on cohen_h_low to find differentially methylated sites?

I find Cohen's h measures distance between two proportions or probabilities, so it does not consider within-group dispersion, is that right? Then I wonder if balanced MAP-based p-value can be an extra filter criteria for within-group variance.

If you need the maximum to be reverted to 300 for your analysis, I can give you a build that reverts that change.

Yes, it will be very helpful to set it to 300 to maintain the consistency with previous analysis. And I wonder if manually setted maximum coverage can be applied, I would like to try to find a good setting and filter criteria based on MAP-based p-value and cohen_h.

Thank you very much!

ZhihanNUS avatar Apr 23 '25 16:04 ZhihanNUS

Hello @ZhihanNUS,

Sorry for being slow to get back, but this is actually an area that I've been thinking a lot about.

If I understand correctly, for my experiment setting, which is 4 samples vs 4 samples, and with high maximum coverage, I should focus on cohen_h_low to find differentially methylated sites?

Yes that's correct.

I find Cohen's h measures distance between two proportions or probabilities, so it does not consider within-group dispersion, is that right?

Also correct.

Then I wonder if balanced MAP-based p-value can be an extra filter criteria for within-group variance.

The MAP-based p-value doesn't really model within-group variance either. The approach I took in Modkit was to combine the replicates together which essentially makes them into a (coverage-)weighted average. With very high coverage what will happen is the posterior beta distribution for each condition almost becomes a point mass. Then you calculate the distribution of the difference of those distributions which will have essentially zero mass at effect size 0.0.

There are fancier techniques you could use. For example you could make a mixture of Beta distributions where you say each replicate picks a component and draws a $p$ from that component. Alternatively, you could make a hierarchical model. I've worked a few of these out and what you often get is something that looks an awful lot like combining the replicates.

What you may want to do (and something I've been looking at as well) is look at the "balanced" MAP-based p-value, which throws away the fact that there are differences in the coverages in the replicates and treats them each as equally measuring the underlying probability of modification.

Like I say, I'm still mulling over some ideas myself so I'd be happy to hear any suggestions or specific use cases.

ArtRand avatar Apr 26 '25 01:04 ArtRand

Hi @ArtRand ,

Thank you very much for the reply. It inspired me a lot.

What you may want to do (and something I've been looking at as well) is look at the "balanced" MAP-based p-value, which throws away the fact that there are differences in the coverages in the replicates and treats them each as equally measuring the underlying probability of modification.

Yes, currently for my analysis, I focus on balanced MAP-based p-value and effect size to find out not only differentially regulated sites, but also a general trend of modification between two conditions (For example, one condition should be hypermethylated based on literature or previous experiment result). That's why I want to manipulate the maximum coverage to find out a good setting to balance the significant sites and their effect size. I think even small effect size, like 0.02 to 0.05, would also contribute to the general modification trend if there are many sites showing this up-regulation, and I don't want to filter them out.

When I want to validate some interesting targets, I will use the cohen_h_low as my ranking metric, so there is no conflict to calculate balanced MAP-based p-value with a relatively high maximum coverage.

So I wonder if it is possible to give me a build that could manually set maximum coverage, or revert it to 300?

Thank you very much!

ZhihanNUS avatar Apr 28 '25 03:04 ZhihanNUS

Hi @ArtRand ,

Is there any update of this problem? Could I have a a build that could manually set the maximum coverage?

ZhihanNUS avatar May 16 '25 03:05 ZhihanNUS