modkit icon indicating copy to clipboard operation
modkit copied to clipboard

Clarification about reported strand needed

Open lubitelpospat opened this issue 5 months ago • 2 comments

Hi folks, Thank you for developing modkit! I was playing with some human data recently, and while doing so, I noticed an unusually high number of m6A's in bedMethyl produced by modkit pileup (version 0.4.4., but the changelog for the most recent one doesn't mention any changes in this area, so I assume everything should be the same in 0.5). I started digging in, and noticed that some of the m6A's are reported as being on the + strand (which, well, implies that the reference genome - I'm using Ensembl release 113, dna primary_assembly - should have an A at that position), while the reference genome has T there. Here's an example fragment of such bedMethyl:

MT      9939    9940    a       6       +       9939    9940    255,0,0 6       33.33   2       3       1       6       319     7558    0
MT      9939    9940    m       3       +       9939    9940    255,0,0 3       0.00    0       3       0       6       319     7561    0
MT      9939    9940    17596   6       +       9939    9940    255,0,0 6       16.67   1       3       2       6       319     7558    0
MT      9939    9940    17802   7554    +       9939    9940    255,0,0 7554    0.57    43      7511    0       6       319     10      0

Another thing you can notice is that pseudouridine is also reported on + strand (which is fine), but I also get "m" here - AFAIK BAM file format doesn't/shouldn't report "m" for A? I do get modifications on - strand as well, so not all mods are reported on "+". I am probably missing something, so I would highly appreciate your feedback for this. Many thanks!

Update: I noticed that in some (not big) fraction of cases, m6A is also reported on sites where the reference has C and G nucleotides as well.

lubitelpospat avatar Jul 03 '25 01:07 lubitelpospat

Hello @lubitelpospat,

This is similar to the another question.

tl;dr: For a given position modkit pileup reports all base modifications (including unmodified) where there is at least one valid read, a read with a base modification probability $\ge$ the pass-threshold.

When you have very deep coverage (looks like 7.5K here) you'll inevitably get a few reads with mismatches and confident base modification probabilities. You can filter these out by requiring that valid_coverage is above a certain number but I appreciate this is difficult to know a priori. I'm working to add an --only-matches flag that will discard mismatches as failed reads.

ArtRand avatar Jul 07 '25 22:07 ArtRand

Hi @@ArtRand, Thank you very much for a clarification! Yes, it would be very useful to have the --only-matches flag. As always - thank you very much guys for your work!

lubitelpospat avatar Jul 29 '25 22:07 lubitelpospat