easy-search SAM alignment extends off the end of the reference
Thanks for your hard work on MMSeqs!
I ran the following using v17.b804f installed via bioconda:
mmseqs easy-search --search-type 2 --format-mode 1 --greedy-best-hits 1 query.fna hiv.fna result.sam tmp
Query:
>test
CAAAGACTGCTGACACAGACTGCTGACATGAAGATTGCTGACAGGGACTTTCCGCGGGACTTTCCAGGGGGGTGTGGTTTGGGCGGAGTTGGGGAGTGGCTAACCCTCAGATGCTGCATATAAGCAGCGGCTTCTCGCTTGTACTGGGTCTCTCTGATT
Reference: NC_001802.1
I got the following alignment:
@HD VN:1.4 SO:queryname
@SQ SN:NC_001802.1 LN:9181
test 16 NC_001802.1 9094 254 111M * 0 0 AGAGAGACCCAGTACAAGCGAGAAGCCGCTGCTTATATGCAGCATCTGAGGGTTAGCCACTCCCCAACTCCGCCCAAACCACACCCCCCTGGAAAGTCCCGCGGAAAGTCC * AS:i:118 NM:i:13
However, I think there is a bug as the start coordinate is 9094 and alignment length 111 (from CIGAR) = end of alignment 9195, which is longer than the reference (9181bp). What I think is going on is the start coordinate is flipped with the end coordinate, as when I TBLASTX these sequences, the alignment spans ~8994 to ~9098:
OR
and when I output in SAM via blastn, I get the following:
@HD VN:1.2 SO:coordinate GO:reference
@SQ SN:Query_1 LN:9181
@PG ID:0 VN:2.16.0+ CL:blastn.REAL -query hiv.fna -db query.fna -outfmt 17 -task blastn PN:blastn
BL_ORD_ID:0 0 Query_1 8956 255 18H37M2D104M * 0 0 * *AS:i:168 EV:f:1.36602e-40 NM:i:2 PI:f:85.11 BS:f:152.769
Not sure if related to https://github.com/soedinglab/MMseqs2/issues/845. I tried reviewing the source but couldn't spot the issue and it looks like MMSeqs tries to do the correct thing (as per SAM spec) here, https://github.com/soedinglab/MMseqs2/blob/8ef870f95af2a3ee474c2cdbb845f5f007fe5be6/src/util/convertalignments.cpp#L733-L734, so I don't know - any ideas? Thanks again!