Diamond and frameshifts in longer hits. Different results in Blastx vs Blastp [Bug]
Hello,
I have an issue where diamond sometimes overlooks frameshifts when comparing longer queries against subjects.
I am running the following command.
diamond blastx -d test.dmnd -q input.fa -o test.txt --very-sensitive --max-hsps 0 --masking 0 --outfmt 6 qseqid sseqid qframe evalue bitscore qstart qend sstart send qseq_translated
Here is the input and the dmnd database. test.zip
Running this input against the diamond database returns the following. (Truncated for brevity)
NODE_175077 224129_0:0009af -3 5.42e-69 204 724 50 29 218 ASEAVAILNEKGLGIETGASLELIASDVLDHYRTYSLLEKLLSSPNKLQEQLAFQIDPQTRLLLIEKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXRRQFDNVKRVFKTVEEMPGPIIENIKNLFYLPDELAKKYACIVFITCIRFETSKKKLQYLPFNSWKKCCEAIMEQWTYKITGKS NODE_175077 224129_0:0009a1 -3 6.75e-66 201 724 56 29 216 ASEAVAILNEKGLGIETGASLELIASDVLDHYRTYSLLEKLLSSPNKLQEQLAFQIDPQTRLLLIEKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXRRQFDNVKRVFKTVEEMPGPIIENIKNLFYLPDELAKKYACIVFITCIRFETSKKKLQYLPFNSWKKCCEAIMEQWTYKITG NODE_175077 77166_0:001cbe -3 1.95e-65 200 724 56 28 215 ASEAVAILNEKGLGIETGASLELIASDVLDHYRTYSLLEKLLSSPNKLQEQLAFQIDPQTRLLLIEKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXRRQFDNVKRVFKTVEEMPGPIIENIKNLFYLPDELAKKYACIVFITCIRFETSKKKLQYLPFNSWKKCCEAIMEQWTYKITG NODE_175077 7048_0:0021c2 -3 2.12e-64 197 724 47 28 218 ASEAVAILNEKGLGIETGASLELIASDVLDHYRTYSLLEKLLSSPNKLQEQLAFQIDPQTRLLLIEKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXRRQFDNVKRVFKTVEEMPGPIIENIKNLFYLPDELAKKYACIVFITCIRFETSKKKLQYLPFNSWKKCCEAIMEQWTYKITGKSC
The internal X's are obviously suspicious.
Running HMMER returns the following.
Query: 1612at33392.aln [M=358] Scores for complete sequences (score includes all domains): --- full sequence --- --- best 1 domain --- -#dom- E-value score bias E-value score bias exp N Sequence Description ------- ------ ----- ------- ------ ----- ---- -- -------- ----------- 6.8e-71 228.3 2.6 3.8e-43 137.0 0.5 2.0 2 NODE_175077 :[revcomp]:[translate(3)] 2.9e-20 61.7 0.1 4.1e-20 61.3 0.1 1.1 1 NODE_175077 :[revcomp]:[translate(2)]
Domain annotation for each sequence (and alignments):
NODE_175077 :[revcomp]:[translate(3)] score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc
1 ! 91.3 0.1 9.9e-30 3e-29 29 95 .. 9 75 .. 7 75 .. 0.98 2 ! 137.0 0.5 1.3e-43 3.8e-43 135 218 .. 150 233 .. 150 235 .. 0.98
Alignments for each domain: == domain 1 score: 91.3 bits; conditional E-value: 9.9e-30 1612at33392.aln 29 sseavsilkqkglakqtgasleliasdvldhyrtysllekllsapsklseqlafqiepqtkqlliek 95 +seav+il++kgl+ +tgasleliasdvldhyrtyslleklls+p+kl+eqlafqi+pqt+ lliek NODE_175077 9 ASEAVAILNEKGLGIETGASLELIASDVLDHYRTYSLLEKLLSSPNKLQEQLAFQIDPQTRLLLIEK 75 79***************************************************************97 PP
== domain 2 score: 137.0 bits; conditional E-value: 1.3e-43 1612at33392.aln 135 rrqfdnvkrvfkaveelpgslvanikktfllseelakkyaavvflaclrfetskrklqylsfedlkkcaeaimesWtytltgpe 218 rrqfdnvkrvfk+vee+pg +++nik+ f+l++elakkya++vf++c+rfetsk+klqyl f+++kkc+eaime+Wty++tg+ NODE_175077 150 RRQFDNVKRVFKTVEEMPGPIIENIKNLFYLPDELAKKYACIVFITCIRFETSKKKLQYLPFNSWKKCCEAIMEQWTYKITGKS 233 8*******************************************************************************9975 PP
NODE_175077 :[revcomp]:[translate(2)] score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc
1 ! 61.3 0.1 1.4e-20 4.1e-20 95 135 .. 92 132 .. 89 132 .. 0.98
Alignments for each domain: == domain 1 score: 61.3 bits; conditional E-value: 1.4e-20 1612at33392.aln 95 kyyelddavirellGkklssksrkdldevaektgvklkscr 135 +yye+dd+virellGkklssk+rkdldevae+tg++lkscr NODE_175077 92 RYYEFDDEVIRELLGKKLSSKNRKDLDEVAERTGKPLKSCR 132 6***************************************8 PP
HMMER returns a frameshift here.
This is a rather rare, but frustrating, issue. Happening only 27 times across USCOs in the genome I'm testing.
A very particular set of circumstances must occur to trigger this bug. The query sequence must match the subject in frame X, experience a frameshift to Y, and then experience another frameshift back to X. (Diamond correctly handles a frameshift from X to Y to Z.) When these circumstances arise Diamond returns one hit in frame X and drops the internal frame Y section.
IMO, what diamond SHOULD do is report two hits in frame X and one in frame Y for the query sequence. The below is also acceptable.
I can prepare several additional examples of this bug if you would like them for testing.
Edit: This issue only happens when running in blastx mode. Diamond finds the hit when I use blastp and feed it translated sequences.
NODE_175077:[revcomp]:[translate(3)] 41895_0:001bbb 0 7.17e-71 214 9 231 29 215 NODE_175077:[revcomp]:[translate(2)] 41895_0:001bbb 0 7.65e-17 70.9 92 132 95 135
Yes I'm afraid the --max-hsps logic seems to be broken in blastx mode. The only solution for now is to use the frameshift mode, you will get the correct alignment:
diamond blastx -q input.fa -d test.dmnd -F 15 -k0 -f 0 --masking 0 --very-sensitive --max-hsps 0
BLASTP 2.3.0+
Query= NODE_175077
Length=750
>224129_0:0009af
Length=220
Score = 270 bits (689), Expect = 1.35e-94
Identities = 146/231 (63%), Positives = 171/231 (74%), Gaps = 35/231 (15%)
Frame = -3
Query 742 WT*FILASEAVAILNEKGLGIETGASLELIASDVLDHYRTYSLLEKLLSSPNKLQEQLAF 563
W ASEAV+ LN++GLG +TGASLELIASDVLDHYRTYSLLEKLL++PNKLQEQLAF
Sbjct 23 WIQGFSASEAVSTLNQRGLGQKTGASLELIASDVLDHYRTYSLLEKLLTNPNKLQEQLAF 82
Query 562 QIDPQTRLLLIE/KVNY*TSSV*CT*CLPFRYYEFDDEVIRELLGKKLSSKNRKDLDEVA 387
QIDP TR LIE YY DD V+RELLGKKLSSK+RKDLDEVA
Sbjct 83 QIDPDTRQFLIE-S-----------------YYAIDDNVVRELLGKKLSSKHRKDLDEVA 124
Query 386 ERTGKPLKSCR*NLAYFVYYVFN*N*FL\RRQFDNVKRVFKTVEEMPGPIIENIKNLFYL 209
E+TG PLKSC RRQFDN+KR+FK+VEEMPGPI++NI+ LFYL
Sbjct 125 EKTGVPLKSC-------------------RRQFDNIKRIFKSVEEMPGPIVQNIQKLFYL 165
Query 208 PDELAKKYACIVFITCIRFETSKKKLQYLPFNSWKKCCEAIMEQWTYKITGKS 50
++LA+KYACIVF+ CIRFETSK++LQYL F + K+C E IM+ WTY +TGKS
Sbjct 166 HEDLARKYACIVFLACIRFETSKRRLQYLDFITLKQCTEVIMDLWTYNVTGKS 218
Thanks for the comment. Hopefully this is fixed soon.
Sadly, this isn't a viable workaround for me. First I'd have to store the mapped translated query. This would at minimum quintuple the storage footprint of diamond output even after compression.
Second, I'd have to parse every single hit to scan for this issue. This would cause memory usage and post processing time to exponentially increase. This is the major problem.
Translation beforehand and running the queries through blastp mode was also unsatisfactory. It's about 4x slower than processing queries in blastx mode.
The workaround I arrived at was passing all my diamond hits through HMMER. This serves as an additional filter of LQ hits and it's easy to detect these internal frameshifts when they happen.
This allows me to still take advantage of diamond exponentially faster speed to initially map queries to targets. 95% of sequences never see HMMER while the remaining 5% or so are only checked against a single HMM.