modkit icon indicating copy to clipboard operation
modkit copied to clipboard

Recommended MAPQ for modification analysis

Open kamaloxfordpathology opened this issue 9 months ago • 4 comments

Hi,

Hope you are well. I have direct RNA sequencing data, generated using the latest RNA flowcell. Performed basecalling and alignment with the recommended parameters for spliced reads including a junc-bed. It seems that 80% percent of my reads have MAPQ of about 0. About 10% percent of the reads have a MAPQ score of 60. Kindly recommend a MAPQ threshold for filtering reads before utilizing modkit for downstream analysis. Thanks a lot!

Best, K

kamaloxfordpathology avatar Mar 30 '25 10:03 kamaloxfordpathology

Hello,

Filtering thresholds for MAPQ really depend on your application. MAPQ or Mapping Quality is the phred-scaled probability that the read is aligned in the incorrect location. Crucially, it not a measure of basecalling performance or alignment accuracy. For example, if a single read aligns to two locations in a reference with the same accuracy, both alignments would have a MAPQ of 0, because it is not possible to distinguish which is the 'true' alignment.

Typically, the best-aligned read will be reported as the primary alignment and other alignments with similar accuracy will be reported as secondary alignments. The similarity thresholds and number of alignments to report are based on your aligner settings. In the case of MAPQ=0, some of the secondary alignments would have the same alignment accuracy as the primary alignment.

If you're aligning RNA reads to a diploid reference genome (e.g. T2T HG002), it is reasonable to expect MAPQ=0 for a large portion of the reads as they will align equally well to either haplotype. This can also be the case when looking at similar isoforms of the same gene, especially if your reads don't span the entire transcript. If it is important to look only at unique alignments, a higher MAPQ filter may be sensible but it will filter out a portion of your dataset.

It's up to you to determine what is right for your workflow!

Best, Bronya

Additional resources: SAM Format Specification: https://samtools.github.io/hts-specs/SAMv1.pdf Minimap2 Manual Page: https://lh3.github.io/minimap2/minimap2.html Other Discussions re MAPQ: https://github.com/lh3/minimap2/issues/223,

NanoBronya avatar Mar 31 '25 16:03 NanoBronya

Thanks a lot Bronya!. I have aligned to the hg38 genome. When calculating modkit pileup, would reads that can align equally well to other positions add uncertainity to the RNA modification at that position. Hence, I am leaning toward a MAPQ cut-off about 20. Thanks

Best, K

kamaloxfordpathology avatar Mar 31 '25 16:03 kamaloxfordpathology

Hello @kamaloxfordpathology

I have aligned to the hg38 genome. When calculating modkit pileup, would reads that can align equally well to other positions add uncertainity to the RNA modification at that position.

One thing to keep in mind, although I'm sure you already know this, is that pileup will only use primary alignments. However, if you want to look at the modification information contained in non-primary alignments, pass the -Y flag to dorado aligner when performing the alignment then modkit extract [full/calls] --allow-non-primary to get a table containing the base modification information. I'd recommend first selecting only modBAM records that you're interested in as input to modkit extract, since this table can be quite large. If you're seeing that multi-mapping reads are leading to ambiguity or "noise" in the pileup, I'd be keen to see it.

ArtRand avatar Mar 31 '25 20:03 ArtRand

Thanks @ArtRand . I shall get back to you with a example once I run a pileup with MAPQ threshold filtering. My understanding is that when MAPQ score is zero, an alignment is randomly chosen as the primary alignment. Since 80% of reads have a MAPQ of zero, would that not result in a large number of primary alignments being randomly assigned, hence contributing to the uncertainity during modkit pileup operation. Thanks a lot!.

kamaloxfordpathology avatar Apr 01 '25 06:04 kamaloxfordpathology