modkit icon indicating copy to clipboard operation
modkit copied to clipboard

More precision on how are base modification probabilities calculated with Modkit

Open OceaneMion opened this issue 1 month ago • 4 comments

Hi,

I have read the documentation on base modification probability calculation, but some points are still not completely clear to me. I would really appreciate it if you could help clarify a few things.

From my understanding:

The MM tag encodes the positions of modified bases (e.g., methylated cytosines).

The ML tag encodes, in 8 bits (0–255), the likelihood that the modification in the MM tag is present. For example, if a value in ML is 100, does that correspond to 100/256 ≈ 39% probability that the base is actually methylated?

Based on the Modkit pileup, I understand that the MM tag gives the position and ML the modification probability.

A concrete example I am unsure about:

If in the MM tag a value of 1 indicates methylation on the second cytosine, and the ML values for 5mC and 5hmC are both 80/256 ≈ 31.25%, how is the base classified? Would it be classified as canonical as it is the highest probability ? Therefore, does Modkit take the modification with the highest probability for classification (canonical, 5mC, 5hmC)?

Regarding aggregation across reads, I am trying to confirm my understanding:

Suppose I have 3 reads that have a methylated cytosine at the same genomic position:

5mC probabilities: 30%, 60%, 80%

5hmC probabilities: 60%, 10%, 10%

For modkit pileup with a threshold for modified C set at 0.8, is the mean calculated as:

5mC: probability of 5mC that pass the threshold / total read that pass the threhold = 0.8 / 1 = 0.8

5hmC: 0.1 / 1 = 0.1

→ Base classified as 5mC?

Without filtering (no threshold), would it be:

5mC: (0.3+0.6+0.8)/3 = 0.567

5hmC: (0.6+0.1+0.1)/3 = 0.267

Canonical: 1 - 0.567 - 0.267 = 0.166 → Base also classified as 5mC?

Finally, do you consider it essential to apply a threshold for analyzing methylation data as it will remove information from the reads, and to check the distribution of probabilities using sample-probs, as suggested in the documentation?

Sorry for the long post, but I want to make sure I understand how Modkit calculates methylation probabilities and how to analyze the data correctly.

Thank you for your help!

OceaneMion avatar Nov 06 '25 12:11 OceaneMion

Hello @OceaneMion,

The MM tag encodes the positions of modified bases (e.g., methylated cytosines).

Correct

The ML tag encodes, in 8 bits (0–255), the likelihood that the modification in the MM tag is present. For example, if a value in ML is 100, does that correspond to 100/256 ≈ 39% probability that the base is actually methylated?

Almost, as I recently mentioned in this issue the ML values represent intervals of probability. In Modkit the conversion is $p = \frac{q + 0.5}{256}$ where $p$ is the probability and $q$ is the ML "qual" value. This probability is the value output by the model for this class, so if the MM tag is C+h then a value of 100 corresponds to a probability of 0.39257813 ($\frac{100 + 0.5}{256}$). We try our best to calibrate the model outputs so that a probability of, say 0.8 means that the base modification is actually the reported class 80% of the time. Continuing on with our example, a ML value of 100 (probability of 0.39) in a classification task of 5mC, 5hmC, and unmodified is pretty close to $\frac{1}{N_{\text{classes}}} = 0.33$ so the model isn't very confident in the base modification status of this base in this read.

If in the MM tag a value of 1 indicates methylation on the second cytosine, and the ML values for 5mC and 5hmC are both 80/256 ≈ 31.25%, how is the base classified? Would it be classified as canonical as it is the highest probability?

That's correct, although the canonical probability in this case $1 - 0.325 - .3125 = 0.375%$ will likely fall below the filter threshold since it's not substantially greater than $\frac{1}{3}$

Therefore, does Modkit take the modification with the highest probability for classification (canonical, 5mC, 5hmC)?

Correct, the most likely class where the probability of that class is $\geq$ the threshold value. There are some examples here.

Regarding aggregation across reads, I am trying to confirm my understanding: Suppose I have 3 reads that have a methylated cytosine at the same genomic position: 5mC probabilities: 30%, 60%, 80% 5hmC probabilities: 60%, 10%, 10% For modkit pileup with a threshold for modified C set at 0.8, is the mean calculated as: 5mC: probability of 5mC that pass the threshold / total read that pass the threhold = 0.8 / 1 = 0.8 5hmC: 0.1 / 1 = 0.1 → Base classified as 5mC? Without filtering (no threshold), would it be: 5mC: (0.3+0.6+0.8)/3 = 0.567 5hmC: (0.6+0.1+0.1)/3 = 0.267 Canonical: 1 - 0.567 - 0.267 = 0.166 → Base also classified as 5mC?

Not quite, pileup aggregates calls not probabilities. So using your example, the algorithm is this:

Read p_5mC p_5hmC p_unmodified call
1 0.3 0.6 0.1 5hmC => Fail since 0.6 < 0.8
2 0.6 0.1 0.3 5mC => Fail since 0.6 < 0.8
3 0.8 0.1 0.1 5mC => Pass since 0.8 <= 0.8

So the record at this location would show:

Field Count
N_modified 1
Mod code m
N_other 0
N_canonical 0
N_valid_coverage 1
N_filtered 2

Without filtering the counts would look like this:

Field Count
N_modified 2
Mod code m
N_other 1 (this is the 5hmC count)
N_canonical 0
N_valid_coverage 3
N_filtered 0

Finally, do you consider it essential to apply a threshold for analyzing methylation data as it will remove information from the reads, and to check the distribution of probabilities using sample-probs, as suggested in the documentation?

For most analysis, especially cytosine methylation, using the defaults in Modkit are recommended. For routine quality control, looking at the output distributions from sample-probs is a good idea.

Sorry for the long post, but I want to make sure I understand how Modkit calculates methylation probabilities and how to analyze the data correctly.

No problem! Sorry for the delay getting back to you.

ArtRand avatar Nov 13 '25 17:11 ArtRand

Thanks a lot, it is much clearer to me now, but I was wondering about cases where I have a very low Nvalid_coverage. For example, in my samples I have lot of position with Nvalid_cov =1. In such situations, the methylation calculation becomes highly biased toward false positives.

Do you think it would be useful to incorporate Nfail or Ndiff into the calculation? For example, instead of using: Nmod/Nvalid_cov

do : Nmod/Nvalid_cov+Nfail+Ndiff and other parameters

I have seen some paper use this kind of metrics instead of the usual calculation of modkit,do you think it is relevant.

OceaneMion avatar Nov 14 '25 10:11 OceaneMion

It seem that when I specified the cpg tag my coverage is a lot better, is it because modkit try to predict other site then CG without the cpg tag ?

OceaneMion avatar Nov 14 '25 11:11 OceaneMion

Hello @OceaneMion,

Do you think it would be useful to incorporate Nfail or Ndiff into the calculation? For example, instead of using: Nmod/Nvalid_cov do : Nmod/Nvalid_cov+Nfail+Ndiff and other parameters

I can't recommend using this metric. If you have valid-coverage the %-methylation will be dominated by sampling bias. However, depending on you biological question, you may be able to use regional methylation information since methylation levels can be locally correlated.

It seem that when I specified the cpg tag my coverage is a lot better, is it because modkit try to predict other site then CG without the cpg tag ?

If you've performed "all-context" cytosine methylation calling you will get a base modification call at each cytosine in each read. If your sample only has methylation at CpGs (such as humans, for the most part) then the 5mC/5hmC methylation calls at CH dinucleotides may have more false positives.

ArtRand avatar Nov 20 '25 04:11 ArtRand