diamond icon indicating copy to clipboard operation
diamond copied to clipboard

Diamond and frameshifts in longer hits. Different results in Blastx vs Blastp [Bug]

Open kevinmoran1988 opened this issue 2 years ago • 2 comments

Hello,

I have an issue where diamond sometimes overlooks frameshifts when comparing longer queries against subjects.

I am running the following command.

diamond blastx -d test.dmnd -q input.fa -o test.txt --very-sensitive --max-hsps 0 --masking 0 --outfmt 6 qseqid sseqid qframe evalue bitscore qstart qend sstart send qseq_translated

Here is the input and the dmnd database. test.zip

Running this input against the diamond database returns the following. (Truncated for brevity)

NODE_175077 224129_0:0009af -3 5.42e-69 204 724 50 29 218 ASEAVAILNEKGLGIETGASLELIASDVLDHYRTYSLLEKLLSSPNKLQEQLAFQIDPQTRLLLIEKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXRRQFDNVKRVFKTVEEMPGPIIENIKNLFYLPDELAKKYACIVFITCIRFETSKKKLQYLPFNSWKKCCEAIMEQWTYKITGKS NODE_175077 224129_0:0009a1 -3 6.75e-66 201 724 56 29 216 ASEAVAILNEKGLGIETGASLELIASDVLDHYRTYSLLEKLLSSPNKLQEQLAFQIDPQTRLLLIEKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXRRQFDNVKRVFKTVEEMPGPIIENIKNLFYLPDELAKKYACIVFITCIRFETSKKKLQYLPFNSWKKCCEAIMEQWTYKITG NODE_175077 77166_0:001cbe -3 1.95e-65 200 724 56 28 215 ASEAVAILNEKGLGIETGASLELIASDVLDHYRTYSLLEKLLSSPNKLQEQLAFQIDPQTRLLLIEKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXRRQFDNVKRVFKTVEEMPGPIIENIKNLFYLPDELAKKYACIVFITCIRFETSKKKLQYLPFNSWKKCCEAIMEQWTYKITG NODE_175077 7048_0:0021c2 -3 2.12e-64 197 724 47 28 218 ASEAVAILNEKGLGIETGASLELIASDVLDHYRTYSLLEKLLSSPNKLQEQLAFQIDPQTRLLLIEKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXRRQFDNVKRVFKTVEEMPGPIIENIKNLFYLPDELAKKYACIVFITCIRFETSKKKLQYLPFNSWKKCCEAIMEQWTYKITGKSC

The internal X's are obviously suspicious.

Running HMMER returns the following.

Query: 1612at33392.aln [M=358] Scores for complete sequences (score includes all domains): --- full sequence --- --- best 1 domain --- -#dom- E-value score bias E-value score bias exp N Sequence Description ------- ------ ----- ------- ------ ----- ---- -- -------- ----------- 6.8e-71 228.3 2.6 3.8e-43 137.0 0.5 2.0 2 NODE_175077 :[revcomp]:[translate(3)] 2.9e-20 61.7 0.1 4.1e-20 61.3 0.1 1.1 1 NODE_175077 :[revcomp]:[translate(2)]

Domain annotation for each sequence (and alignments):

NODE_175077 :[revcomp]:[translate(3)] score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc


1 ! 91.3 0.1 9.9e-30 3e-29 29 95 .. 9 75 .. 7 75 .. 0.98 2 ! 137.0 0.5 1.3e-43 3.8e-43 135 218 .. 150 233 .. 150 235 .. 0.98

Alignments for each domain: == domain 1 score: 91.3 bits; conditional E-value: 9.9e-30 1612at33392.aln 29 sseavsilkqkglakqtgasleliasdvldhyrtysllekllsapsklseqlafqiepqtkqlliek 95 +seav+il++kgl+ +tgasleliasdvldhyrtyslleklls+p+kl+eqlafqi+pqt+ lliek NODE_175077 9 ASEAVAILNEKGLGIETGASLELIASDVLDHYRTYSLLEKLLSSPNKLQEQLAFQIDPQTRLLLIEK 75 79***************************************************************97 PP

== domain 2 score: 137.0 bits; conditional E-value: 1.3e-43 1612at33392.aln 135 rrqfdnvkrvfkaveelpgslvanikktfllseelakkyaavvflaclrfetskrklqylsfedlkkcaeaimesWtytltgpe 218 rrqfdnvkrvfk+vee+pg +++nik+ f+l++elakkya++vf++c+rfetsk+klqyl f+++kkc+eaime+Wty++tg+ NODE_175077 150 RRQFDNVKRVFKTVEEMPGPIIENIKNLFYLPDELAKKYACIVFITCIRFETSKKKLQYLPFNSWKKCCEAIMEQWTYKITGKS 233 8*******************************************************************************9975 PP

NODE_175077 :[revcomp]:[translate(2)] score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc


1 ! 61.3 0.1 1.4e-20 4.1e-20 95 135 .. 92 132 .. 89 132 .. 0.98

Alignments for each domain: == domain 1 score: 61.3 bits; conditional E-value: 1.4e-20 1612at33392.aln 95 kyyelddavirellGkklssksrkdldevaektgvklkscr 135 +yye+dd+virellGkklssk+rkdldevae+tg++lkscr NODE_175077 92 RYYEFDDEVIRELLGKKLSSKNRKDLDEVAERTGKPLKSCR 132 6***************************************8 PP

HMMER returns a frameshift here.

This is a rather rare, but frustrating, issue. Happening only 27 times across USCOs in the genome I'm testing.

A very particular set of circumstances must occur to trigger this bug. The query sequence must match the subject in frame X, experience a frameshift to Y, and then experience another frameshift back to X. (Diamond correctly handles a frameshift from X to Y to Z.) When these circumstances arise Diamond returns one hit in frame X and drops the internal frame Y section.

IMO, what diamond SHOULD do is report two hits in frame X and one in frame Y for the query sequence. The below is also acceptable.

I can prepare several additional examples of this bug if you would like them for testing.

Edit: This issue only happens when running in blastx mode. Diamond finds the hit when I use blastp and feed it translated sequences.

NODE_175077:[revcomp]:[translate(3)] 41895_0:001bbb 0 7.17e-71 214 9 231 29 215 NODE_175077:[revcomp]:[translate(2)] 41895_0:001bbb 0 7.65e-17 70.9 92 132 95 135

kevinmoran1988 avatar Dec 26 '23 15:12 kevinmoran1988

Yes I'm afraid the --max-hsps logic seems to be broken in blastx mode. The only solution for now is to use the frameshift mode, you will get the correct alignment:

diamond blastx -q input.fa -d test.dmnd -F 15 -k0 -f 0 --masking 0 --very-sensitive --max-hsps 0
BLASTP 2.3.0+


Query= NODE_175077

Length=750

>224129_0:0009af
Length=220

 Score = 270 bits (689),  Expect = 1.35e-94
 Identities = 146/231 (63%), Positives = 171/231 (74%), Gaps = 35/231 (15%)
 Frame = -3

Query  742  WT*FILASEAVAILNEKGLGIETGASLELIASDVLDHYRTYSLLEKLLSSPNKLQEQLAF 563
            W     ASEAV+ LN++GLG +TGASLELIASDVLDHYRTYSLLEKLL++PNKLQEQLAF
Sbjct   23  WIQGFSASEAVSTLNQRGLGQKTGASLELIASDVLDHYRTYSLLEKLLTNPNKLQEQLAF 82

Query  562  QIDPQTRLLLIE/KVNY*TSSV*CT*CLPFRYYEFDDEVIRELLGKKLSSKNRKDLDEVA 387
            QIDP TR  LIE                   YY  DD V+RELLGKKLSSK+RKDLDEVA
Sbjct   83  QIDPDTRQFLIE-S-----------------YYAIDDNVVRELLGKKLSSKHRKDLDEVA 124

Query  386  ERTGKPLKSCR*NLAYFVYYVFN*N*FL\RRQFDNVKRVFKTVEEMPGPIIENIKNLFYL 209
            E+TG PLKSC                   RRQFDN+KR+FK+VEEMPGPI++NI+ LFYL
Sbjct  125  EKTGVPLKSC-------------------RRQFDNIKRIFKSVEEMPGPIVQNIQKLFYL 165

Query  208  PDELAKKYACIVFITCIRFETSKKKLQYLPFNSWKKCCEAIMEQWTYKITGKS 50
             ++LA+KYACIVF+ CIRFETSK++LQYL F + K+C E IM+ WTY +TGKS
Sbjct  166  HEDLARKYACIVFLACIRFETSKRRLQYLDFITLKQCTEVIMDLWTYNVTGKS 218

bbuchfink avatar Jan 08 '24 14:01 bbuchfink

Thanks for the comment. Hopefully this is fixed soon.

Sadly, this isn't a viable workaround for me. First I'd have to store the mapped translated query. This would at minimum quintuple the storage footprint of diamond output even after compression.

Second, I'd have to parse every single hit to scan for this issue. This would cause memory usage and post processing time to exponentially increase. This is the major problem.

Translation beforehand and running the queries through blastp mode was also unsatisfactory. It's about 4x slower than processing queries in blastx mode.

The workaround I arrived at was passing all my diamond hits through HMMER. This serves as an additional filter of LQ hits and it's easy to detect these internal frameshifts when they happen.

This allows me to still take advantage of diamond exponentially faster speed to initially map queries to targets. 95% of sequences never see HMMER while the remaining 5% or so are only checked against a single HMM.

kevinmoran1988 avatar Jan 11 '24 15:01 kevinmoran1988