miniprot
miniprot copied to clipboard
Chimeric alignments spanning multiple adjacent genes
When aligning proteins with paralogs located in a cluster, miniprot
produces alignments encompassing two or more distinct genes.
For example, aligning mouse protein NP_001036176.1 to human chromosome 1 (NC_000001.11) returns 3 alignments, one of which aligns such that it encompasses 3 distinct genes.
$ miniprot --version
0.8-r220
$ efetch -db nucleotide -id NC_000001.11 -format fasta > genome.fa
$ efetch -db protein -id NP_001036176.1 -format fasta > prot.fa
## PAF output truncated for brevity
$ miniprot genome.fa prot.fa 2>/dev/null
NP_001036176.1 508 0 508 + NC_000001.11 248956422 103617440 103625743 1305 1533 0 AS:i:2173 ms:i:2468 np:i:470 da:i:0 do:i:0 cg:Z:56M345N49M810N51M3D12M445N77M766N44M895V40M97V33M1972N39M114V41M1338V62M
NP_001036176.1 508 0 508 + NC_000001.11 248956422 103571602 103579497 1299 1533 0 AS:i:2159 ms:i:2454 np:i:467 da:i:0 do:i:0 cg:Z:56M339N49M806N51M3D12M447N77M321N44M832V40M98V33M1949N39M114V41M1468V62M
NP_001036176.1 508 0 508 + NC_000001.11 248956422 103617440 103719868 1302 1533 0 AS:i:2155 ms:i:2450 np:i:468 da:i:0 do:i:0 cg:Z:56M39235N49M803N51M3D12M55731N77M801N44M816V40M97V33M1972N39M114V41M1338V62M
These alignments can be seen in the following screenshot. Numbers adjacent to the alignment represent their ranking based on the alignment score. Appropriately, the chimeric alignment spanning multiple genes is ranked the lowest.
While the chimeric alignment in the previous case was ranked the lowest, and secondary to the best alignment it is not always the case. As an example, I have aligned 3 proteins (CAD87763.1, AAZ99031.1, AUO15579.1) to the genome sequence NC_069144.1 and all 3 alignments returned by miniprot
are chimeric as shown in the following screenshot:
The genes in blue boxes on the left and right hand sides are both "pheloloxidase-8-like" genes and encode proteins that are paralogous. All 3 sequences align such that a portion of the sequence aligns to one paralog and the rest aligns to the other paralog.
The alignments are riddled with mismatches and indels, and understandably, when we align distant proteins to a genome it becomes increasingly difficult to arrive at perfect alignments. So, what's an aligner to do? Perhaps an appropriate outcome to this scenario would be to, say, generate two (subpar) alignments for each protein, with unaligned tails and identify the better one as primary. Splign and Prosplign offer some inspiration by using "compart" logic that avoids generating alignments spanning multiple distant locations. Could a similar approach be applied to miniprot
?
I have a dataset of over >165k proteins aligned to the Anopheles cruzii genome and I see chimeric alignments such as the one described above quite often. After playing with -J
and -E
parameters individually and in combination, I have made little progress in systematically avoiding them.
I had a look at the alignment AAZ99031.1. Miniprot found the longer alignment because the shorter alignment is not as good. I don't know how to solve this. How does "compart logic" work?
Also, are there correct alignments around this locus? Maybe you can use those to filter out poor hits?
Thank you @lh3 for taking a look at this. The "compart logic" used in Splign is described here: https://pubmed.ncbi.nlm.nih.gov/18495041/ Binaries for Splign and Compart as well as the source code can be found on the splign website: https://www.ncbi.nlm.nih.gov/sutils/splign/splign.cgi
Also, are there correct alignments around this locus? Maybe you can use those to filter out poor hits?
I should be able to get "correct" alignments but I am not sure how to use those alignments to filter out poor hits? More importantly though, using AAZ99031.1 as an example, I would like to get the shorter alignment even if it is relatively poor quality compared to the longer ones. Is that possible?
@lh3 -- just checking if you have had a chance to look at the compart code...