modkit Differential Methylation Analysis Using MODKIT: CPM Normalization and Handling Missing Data

Hello,

I am currently analyzing differential methylation using MODKIT and comparing results with call file-based analysis. My general strategy is as follows:

For each condition, I calculate:

Number of detected m6A modifications at a site/Number of transcripts for the corresponding gene I then compute differential methylation by taking the ratio of Treatment/CTR. To estimate transcript abundance, I used BAMBU, and I am comparing results based on both transcript counts and CPM normalization.

I have a few questions regarding the appropriateness of this approach:

Is it statistically appropriate to use CPM normalization for differential methylation analysis? Since CPM is commonly used for RNA-seq normalization, I am wondering if applying CPM in the context of m6A modification quantification is valid.
Handling missing modification calls between conditions Due to sequencing depth limitations, some m6A sites may be detected in Treatment but not in CTR, or vice versa. When this happens, the denominator in my calculation becomes zero, leading to NA values.

In RNA-seq differential expression analysis, pseudo counts are often added to handle this issue. However, in MODKIT, the output consists of discrete m6A modification counts rather than continuous expression levels. What would be an appropriate way to introduce pseudo counts in this context?

Should I discard transcripts with missing data? If pseudo counts are not suitable, analyzing differential m6A methylation becomes challenging due to data loss.

Would it be bioinformatically appropriate to filter out transcripts that lack modification calls in one of the conditions and only analyze transcripts with detected m6A modifications in both conditions? Or is there a better approach to retain more data without introducing bias? I am still learning bioinformatics, so I would greatly appreciate any insights or guidance on these issues.

Thank you in advance for your help!

Feb 27 '25 21:02 Seongmin-Jang-1165

Hello @Seongmin-Jang-1165

Sorry for the late response.

Number of detected m6A modifications at a site/Number of transcripts for the corresponding gene

Is this different than the percent methylation reported in the bedMethyl? One concern I would have is that instead of $\frac{\text{N}_{\text{mod}}}{\text{N}_{\text{valid}}}$ your calculation includes "fail" calls which tends to have a negative effect on model performance.

I then compute differential methylation by taking the ratio of Treatment/CTR.

There is a differential methylation module in Modkit, the details of the model can be found in the documentation. I can't really comment on your metric as an alternative, only that the metric in Modkit uses a Bayesian framework so that smaller differences in effect size will require more coverage to be get to the same level of significance as compared to a site with a larger effect size. But importantly, sites with low coverage will usually not be flagged up as significant since they don't have very much evidence.

Is it statistically appropriate to use CPM normalization for differential methylation analysis?

My temptation is to say no. What would happen if you only have one or two reads for each condition and the empirical effect size is 100% (all modified in one sample)? I think that using something like CPM would make this site rank highly when it really shouldn't (and will almost certainly happen by chance).

Handling missing modification calls between conditions Due to sequencing depth limitations, some m6A sites may be detected in Treatment but not in CTR, or vice versa. When this happens, the denominator in my calculation becomes zero, leading to NA values

In RNA-seq differential expression analysis, pseudo counts are often added to handle this issue. However, in MODKIT, the output consists of discrete m6A modification counts rather than continuous expression levels. What would be an appropriate way to introduce pseudo counts in this context?

Should I discard transcripts with missing data? If pseudo counts are not suitable, analyzing differential m6A methylation becomes challenging due to data loss.

I've been working on an idea for this - but I don't have a firm recommendation right now. On the other hand, if you use the HMM segmentation, it will have the effect of using neighboring sites to inform the status of sites that you may be missing data. It still only performs a calculation on the intersection of sites (sites missing in one sample will not be used), but the regions that are emitted may overlap sites with coverage in only one sample.

Would it be bioinformatically appropriate to filter out transcripts that lack modification calls in one of the conditions and only analyze transcripts with detected m6A modifications in both conditions?

It sounds like your workflow requires that all of the m6A sites have sufficient valid coverage at all positions? Maybe this is too strict of a requirement? It's hard for me to give you a recommendation here since filtering out data will probably lead to a bias (as you mentioned) - but maybe that's acceptable depending on your biological experiment. Is it possible to pool a few runs to get more coverage for all of the transcripts?

Mar 05 '25 11:03 ArtRand

@ArtRand Thank you for the advices!! have a good day

Mar 13 '25 21:03 Seongmin-Jang-1165