diamond icon indicating copy to clipboard operation
diamond copied to clipboard

retrieve hits that are identical match to the query?

Open sapuizait opened this issue 3 years ago • 2 comments

Hi there,

I have been trying with diamond blast to perform searches that will output only identical hits (100% identical no mismatches)

What I have tried so far is to set the id to 100 and --query-cover and --subject-cover to 100 While the above work well for long sequences; I tried with a 3.5k length query and the same sequence as subject or with a single aa different, they do not always get me the desired output. In another example, I tried using a 950 aa length sequence as query and the same or slightly different variants of the same sequence as subject (with 1 or 2 aa different), and I cannot distinguish between hits that 100% or 99.7% identical... (see results below) as they all show up as 100% even if they are not (because I assume the numbers are rounded)

VAEP_351.6_7 UPI00090C7091 100.0 949 0 0 1 949 1 949 5.0e-314 1063.1 VAEP_351.6_7 UPI0005CFB5F3 100.0 949 0 0 1 949 1 949 5.0e-314 1063.1 VAEP_351.6_7 UPI000016541A 100.0 949 0 0 1 949 1 949 5.0e-314 1063.1 VAEP_351.6_7 UPI00179CEE32 100.0 949 0 0 1 949 1 949 5.0e-314 1063.1

In the above output table they all have identical percentage identity score and evalue even though there are slight differences in their alignments with the query (1-2 aa) and therefore the percentage identity can be 99.7% but not 100%.

Is there anyway to force the software to give me only the exact identical matches?

Thanks P

sapuizait avatar May 11 '22 17:05 sapuizait

It should work using --id 100 because that is applied to the unrounded number. If it does not, can you send me a test case? Also, you can use -f 6 nident to report the number of identities instead of the percentage, and compare that against the length. Also using diamond to find identical sequences is using a cannon to kill a mosquito. This would be much faster by hashing the sequences and looking for identical hashes.

bbuchfink avatar May 12 '22 06:05 bbuchfink

Thanks, I am using the v2.0.5 by the way Here are some files: query, database (7 sequences including the querry sequence) and the output when using id 100 or not. Most of the sequences that are reported 100% identical are not (they are 99.89 or sth). you can download all files here: https://filebin.net/x6ugyan2ty9pw8ww

And yes, I see your point about using diamond blast for this job but we have a few thousand sequences and we wanted to find the exact matches (if they exist) in Unipark...

sapuizait avatar May 12 '22 07:05 sapuizait