modkit icon indicating copy to clipboard operation
modkit copied to clipboard

Clarification on Interpreting 5mC_5hmC Output from Modkit

Open Proy321 opened this issue 3 months ago • 5 comments

Hello @ArtRand

Image

In this attached example, at position 813563–813564, I notice that in column 11 the value is reported as h:20.00, m:40.00. Does this indicate that the proportion of h (5hmC) is lower compared to m (5mC), and therefore this site should be interpreted as having a higher likelihood of 5mC methylation rather than 5hmC modification?

Alternatively, is it the case that there are about 5 reads overall covering this position, where I can see 2 reads supporting 5mC and 1 read supporting 5hmC—and within the 5mC calls, one might be interpreted as converted to 5hmC while the other remains as 5mC only?

Could you please clarify the correct way to interpret these values.

Thanks

Proy321 avatar Sep 07 '25 18:09 Proy321

Hello @ArtRand

It would be nice to have your insights on the above query posted.

Thanks & Regards Priyanka Roy

Proy321 avatar Sep 10 '25 11:09 Proy321

Hello @Proy321 sorry about the delay.

Does this indicate that the proportion of h (5hmC) is lower compared to m (5mC), and therefore this site should be interpreted as having a higher likelihood of 5mC methylation rather than 5hmC modification.

Yes, however with only 5 reads it is difficult to confidently say the degree of this difference.

Alternatively, is it the case that there are about 5 reads overall covering this position

Not quite, there are exactly 5 reads with valid calls overlapping this position.

where I can see 2 reads supporting 5mC and 1 read supporting 5hmC

yes

—and within the 5mC calls, one might be interpreted as converted to 5hmC while the other remains as 5mC only

No, there are 5 calls, 2 are 5mC, 1 is 5hmC, 2 are unmodified.

ArtRand avatar Sep 10 '25 22:09 ArtRand

In addition to this above clarification @ArtRand , i have few more queries Let’s say at a given site I have 16 total calls, with 2 reads supporting 5mC and 8 reads supporting 5hmC. So, the percent modifed for 5mC is 12.5% and for 5hmC is 50%. So, in this case, should I consider the site as 5hmC rather than 5mC modied, because the percent modified is high compared to 5mC in that position.

As a second scenario, suppose there are 16 total calls at a site, with 8 reads supporting 5mC and another 8 supporting 5hmC. How should this be interpreted? Since both modifications appear at 50%, it seems as if the same site carries both 5mC and 5hmC, but biologically how can a single site be modified at the same time. So how should I interpret such a case where the frequencies are 50% between 5mC and 5hmC.

Proy321 avatar Sep 12 '25 06:09 Proy321

Hello @ArtRand

It would be nice to have your insights on the above query posted.

Thanks & Regards Priyanka Roy

Proy321 avatar Sep 13 '25 11:09 Proy321

Hello @Proy321 sorry about the delay.

The interpretation of the reads will always depend (at least a little) on the biological system you're studying. The sequencer and the analysis software are just measurement tools. I'll try to give some intuition based on the best of my understanding of mammalian DNA methylation of cytosine.

I'm sure you're already familiar with DNA methylation but as a quick level-set:

DNA methylation is resultant of enzymes called methyltransferase enzymes that can methylate DNA (go from unmodified to 5mC) and the TET oxidation pathway can then transform 5mC residues into 5hmC and other oxidated products that may be converted back into unmodified cytosines. There are a bunch of reviews and studies that detail this pathway.

If you sequence an ensemble of cells you may capture some residues in various parts of this pathway. One simplifying model for this is to think of every read reporting on a draw from a categorical distribution. Some positions may be {p_5mC = 0.98, p_unmod = 0.01 p_5hmC = 0.01}, so most of the reads will probably report 5mC. On the other hand, some positions may have a parameter set more like what you've described: {p_5mC = 0.5, p_5hmC = 0.5, p_unmod = 0.0}, so you're more likely to observe reads like in your second scenario.

All this is to say, I wouldn't get stuck trying to think of each position as being one modified base at a time, but a probability distribution over the categories. Each read is a molecule from a cell which is an observation of a draw from that probability distribution.

Does that make sense?

ArtRand avatar Sep 16 '25 04:09 ArtRand