Clarification about reported strand needed
Hi folks, Thank you for developing modkit! I was playing with some human data recently, and while doing so, I noticed an unusually high number of m6A's in bedMethyl produced by modkit pileup (version 0.4.4., but the changelog for the most recent one doesn't mention any changes in this area, so I assume everything should be the same in 0.5). I started digging in, and noticed that some of the m6A's are reported as being on the + strand (which, well, implies that the reference genome - I'm using Ensembl release 113, dna primary_assembly - should have an A at that position), while the reference genome has T there. Here's an example fragment of such bedMethyl:
MT 9939 9940 a 6 + 9939 9940 255,0,0 6 33.33 2 3 1 6 319 7558 0
MT 9939 9940 m 3 + 9939 9940 255,0,0 3 0.00 0 3 0 6 319 7561 0
MT 9939 9940 17596 6 + 9939 9940 255,0,0 6 16.67 1 3 2 6 319 7558 0
MT 9939 9940 17802 7554 + 9939 9940 255,0,0 7554 0.57 43 7511 0 6 319 10 0
Another thing you can notice is that pseudouridine is also reported on + strand (which is fine), but I also get "m" here - AFAIK BAM file format doesn't/shouldn't report "m" for A? I do get modifications on - strand as well, so not all mods are reported on "+". I am probably missing something, so I would highly appreciate your feedback for this. Many thanks!
Update: I noticed that in some (not big) fraction of cases, m6A is also reported on sites where the reference has C and G nucleotides as well.
Hello @lubitelpospat,
This is similar to the another question.
tl;dr: For a given position modkit pileup reports all base modifications (including unmodified) where there is at least one valid read, a read with a base modification probability $\ge$ the pass-threshold.
When you have very deep coverage (looks like 7.5K here) you'll inevitably get a few reads with mismatches and confident base modification probabilities. You can filter these out by requiring that valid_coverage is above a certain number but I appreciate this is difficult to know a priori. I'm working to add an --only-matches flag that will discard mismatches as failed reads.
Hi @@ArtRand,
Thank you very much for a clarification! Yes, it would be very useful to have the --only-matches flag.
As always - thank you very much guys for your work!