modkit icon indicating copy to clipboard operation
modkit copied to clipboard

Modkit extract full 5hmCG + 5mCG > 1

Open amurray411 opened this issue 5 months ago • 2 comments

Hello. I've been using modkit extract (now extract full) and adjust-mods for over a year now for methylation-based classification of human tumor samples, and I recently ran into a problem with modkit extract full where at a particular site on a particular read, the modkit-produced PhmC and PmC add up to greater than 1, and specifically, these values are roughly 1.003906. While I can shift these probabilities to be a proportion of the total h+m value, I would prefer to understand why they do not simply add up to 1, with or without some floating point error.

I'm doing a global methylation analysis whereby I want to identify which of 5hmC, 5mC, or canonical C is most likely for every CG site in the genome, starting with for every read. The total probability greater than 1 is not necessarily an issue in this case, but once I start averaging across reads to get the most likely probability for each CG site, it makes a difference.

Here is a screen shot of my data frame in R, generated by simply rearranging the output of modkit extract full to decrease the number of rows by half, and then for highlighting this issue, filtered based on h+m values greater than 1.

Image

Also, here is a table of the times this problem occurs for one particular sample, with the values listed being my calculations for the canonical C probabilities.

Image

For background, I am retrospectively using the dorado basecaller in super accurate mode, calling only 5hmC and 5mC in CG contexts, and while I am running in duplex mode for the above example, I am eliminating rows in the modkit extract full output that have duplex data. Essentially, I am treating each strand of duplex reads as a unique read, so I expect that each strand should have modification probabilities that are independent of the other strand of the duplex. Accordingly, I also observe this 1.003906 phenomenon in samples run with live simplex basecalling.

I feel that I have read and understand all relevant documentation about modkit extract full and its parameters, but I'm worried that I'm missing some key knowledge of what the code is doing. Any explanation you can provide would be greatly appreciated.

Thank you.

amurray411 avatar Jul 09 '25 19:07 amurray411

Hello @amurray411,

Thank you for bringing this to my attention.

This happens because Modkit will take the quality value in the ML tag which is encoded in 255 bins and adjust the probability to be in the "center" of the bin. So if the canonical probability is very close to zero, you can get a final value greater than 1.0. This is sloppy and should be fixed. Fortunately, I don't think this would have any effect on actual modification calling since the argmax modification will still be the same and the adjustment is ~0.4%. It would help me out a lot if you could:

  1. Attach a modBAM with one of these offending reads, aff9c533 seems like a good candidate since it has two such errors.
  2. Let me know exactly which dorado version and model versions you're using.

I'll get a fix asap.

ArtRand avatar Jul 09 '25 21:07 ArtRand

Thank you so much for the quick response. That makes a lot of sense.

  1. I've generated a bam file with only aff9c533, but was not allowed to attach it here so I created a new repo with only the bam file inside it. https://github.com/amurray411/Issues
  2. The model version is included in the header of this bam file ([email protected]), and the version of dorado for this sample is 0.6.0. As of recently, I've been using dorado 0.8.3 but with the same basecalling model.

Let me know if you need any other information or if the file is not formatted correctly.

amurray411 avatar Jul 10 '25 00:07 amurray411