ngmlr
ngmlr copied to clipboard
Overestimated mapping quality of the shorter piece in a split alignment
The sequence at the end of the message is a simulated sequence from GRCh38. The true position is chr7:4124634-4126723. ngmlr-0.2.5 gives a split alignment. The longer alignment is correct with mapq=33; the shorter alignment, of length 122bp, is wrongly mapped and gets mapq=60. I understand why ngmlr may want to split the alignment, but having mapq=60 on a 122bp wrong alignment seems counterintuitive to me. At ~15% error rate, a 122bp short sequence can hardly be correctly mapped. I have multiple examples like this. It is not a very rare issue.
In practice, such a short segment probably will get filtered out quickly, but ideally, it would be good if ngmlr could report a more accurate mapping quality. Fixing this would also help to avoid potential pitfalls by less careful pipelines.
>test
TATTGTAAAGAAAATTTTCTTTTTCTTTCATTTTTTCAGACTGGAGAAGCAGTGGTGCGATACTTGGCTCACTGCAAACTTCGCCTCCTG
GGTATCAAGCCATTCTCCTGTCTCAGCCCTCCCAAAGTAGCTGGGTATACAGTCAATGTGCCACCACGCCTGGCCTGATGTTTTGAATCT
TTTACTAGTGATGGGGTTCTACCCATGTTTGGCCAGGTGGTCTCGAATGCCTAGACCTAAGTGGATTTCTCGGCCTTCGGCTTCCCAGGC
TGCGGATTACTGGCGTGTGCCACCGCCTTTCCCGGTCCCAGCATGTAAGATCATTTTTCTTTACTTTCCAGTGAAATCTCCTCTTCTCGT
GTACCACACATTCCGAACTTGAACTAATTTGTCTTGATATGTTTGTTATACTGATTCCAAACAATATTATGTATAAAAAGCAGTGAACTT
TAAAAACAAAAAGGGTCATTGTTATTCTAGACAAGTCTAATATTGCTTTTGTTCATTTTTACTTCGTTGGGACCGAATTTAACAAACTTC
AAACACCATTCAGAAGCGAGAAAAAATCCTGGACCTTGTGCTATGGCCCCTAGCAAGTCAGGGAAAGGCCCTGTTTATAAACGAGCATGG
AGGAAACTGCCTTATTCAGCCTGTGGTGAACCTGGAAGATCCGGCCAATACTTAGGGCCCCCATTGGTTGAAAAAGGAGGGAGAGTACAC
TTCTAGTCCAAACCATATCCCATTCGGGGGTTCACTCTTGCAAGAATTAAGCCAACGGAGGTCTGGGAGGATACTGATTGTCGGTGGCCA
AAGTCTCTTTTTTGAATCAGGTGGTTCCGGAATTCAAAACCTAGGGAAATGCCTGGAGTGTGGAGTAAGGCCCCGGAACTAGGGTGGGGC
TGAGATGCTGCTCTGATCGACCTGTCCTCAGGTTGTCCCCTACTTGTGAAAAGGTAAGCCTGGCCTACCTTACCCTACTTAAGAGTCTCT
ACTCTCCCCTCCCGGTGGTCAGCACTATGTTGGTCGCAGGTGTGATCAATAGCTAGGTTGCTTTTGAAGAATAGTGAAAGGCGAGGATTG
CCTTATTCATTGACCAGTCCCTAAGCCTTTAGAAGTGCAGAATGAAAGTAAACCGTGAGGTGCCGTGGTAGGGGCACTGGCACACGAGGA
AGGGGGAATGCGCTAGGAAGCAGTCTGCTTTCCCTTCTCCTTCACAGAGCCGCTACTAGGGCTCAAAGAACAGAAGGATTGGGCACATGA
TGCACCTCTGTAGCAGGCAACACTGCGAAGGACTTCCTCTGCTCTTTATAGACCCCCCTCAATCCTATCCATCCAACATCCATTCCAATC
TCCAATCCTCCATCCCCATCCATCCAATCAATCAATCCAATCCAGTGATGCCCATCATCCATTCATCCATTCCCCCATTCATGACCATTC
CTCAATCCCCCTCCATCCATCAATCCAATCCATCTATCCATTCGTCCCACCCACCCATCTACCCAACCATTCATCCATTCACATCCATCC
ATCATCTTATCATTCCATCCAATCTCATTCACCCCATTCCATCGATTCATTCCCCATCCATCTATCCATCTGCCACATCCATCCATCCAA
TCACCCCATCCAATCCATCATTCGCTCCATCCATCATCCATGTGTCCATCTACCCATCCATCCATTCCATCTCTCATCAACCCATCCTCC
ATCCATCCACCCATCCATCCCACCATAATCGATACATTCCAGCTTTCCATCCCATGCCATTCACCATCCAATCCATTCCTCAATCCATCC
AATTCACATCGCCATCCACCATCCTTTCCACCCATTTACCCATACCATCCATTTCCAATATTCACTTCATCCTTCCCATTTTTTACCAAT
CTATCCATTCACCTCATCCATTATCCAACTCATCCCATACATCCAAATCATCCCCCCTCGACCGATCCATCCATCCATGCTTTCCATTCG
TCTACCCACCATCCATCCCACAATCATTCCCTTCCGTCGCCCCACCCCATCCTGTCATCTACCATCACCCATCTATCCATTCGCCCAACA
CCCTATCCATCTATCCAATTTCACTCATCTATTCTTTCAGTCCATTCAGACTTCCTCATTAAAAAGGAGACCTGAGAAGGGTAGCAACTG
TTAGTTTAGTCTCAAAAGACAGTATAAGTAACATGGGTCGGGGGGCTTGGAGAATAATGCAACCATGGCCATTTTATTTGACAAGATTTT
CTACACGAGGAGAAGTTTGATCTTTGTTATGGCTTGACCTGTGCCCCCTGGG
Thanks for pointing this out.
We have tried different schema for mapping quality and settled with the current one (see Supplement: http://www.biorxiv.org/content/early/2017/07/28/169557 . However, we are further looking into this issue.
Also we see that some reads are sometimes more often split. They don't seem to accumulate in certain regions. Thus, its not a big thing for Sniffles, but we are also working on this. As you know its sometimes hard to decide to split or not in such high sequencing error case and some sequence compositions.
Thanks Fritz
Here is actually a plot from us that illustrates the problem over some GiaB data.
An effective heuristic is to intentionally penalize mapq if the alignment length is short and/or the alignment contains few seed hits. The reasoning behind is that it is more likely for the aligner to miss short suboptimal alignments.
Another simpler heuristic that is only applicable to split alignment is to cap the mapping quality of shorter alignments by the mapq of the longest alignment. This makes sense because for the same read, shorter alignments usually have a higher chance to be wrong than the longest alignment.
Yes, we are currently using the MQ information derived from mapping the 256bp segments per read. Thus, the short segment on its own looks unique, but for the long read it should be penalized. We will work on that and I would be happy to discuss that further with you.
Thanks Fritz