influence of algo mode in sensitivity
Hi,
I have a test case where the --algo parameter as influence in the sensitivity when comparing more-sensitive and ultra-sensitive modes (diamond version 2.0.4)
The more-sensitive mode with --algo 1 (query-indexed) retrieves a hit that is not found with --algo 0 (double-indexed, default chosen mode) and neither with the ultra-sensitive mode.
$ diamond blastp --more-sensitive --query query.faa --db bank.faa --evalue 1.0
RIBOQ RIBO2 30.0 60 41 1 1 59 1 60 3.4e-06 31.2
$ diamond blastp --more-sensitive --query query.faa --db bank.faa --evalue 1.0 --algo 0 #force double-indexed, the default chosen mode
RIBOQ RIBO2 30.0 60 41 1 1 59 1 60 3.4e-06 31.2
$ diamond blastp --more-sensitive --query query.faa --db bank.faa --evalue 1.0 --algo 1 #force query-indexed
RIBOQ RIBO1 41.4 58 33 1 1 57 1 58 6.1e-08 37.0
RIBOQ RIBO2 30.0 60 41 1 1 59 1 60 3.4e-06 31.2
$ diamond blastp --ultra-sensitive --query query.faa --db bank.faa --evalue 1.0
# return no hit
The protein files:
$ cat query.faa
>RIBOQ
MAKLKTRRGAAKRFKATANGFKRKQAFKRHILTKKSAKRIRQLRGCVMVHVSDMNSVRRM
CPYI
$ cat bank.faa
>RIBO1
MPKAKTHSGASKRFRRTGTGKIVRQKANRRHLLEHKPSTRTRRLDGRTVVAANDTKRVTS
LLNG
>RIBO2
MPKIKTKKSFTKRFRITKNGIILRRSTGLNHYRSKKTGQQVRNSRKMVRISDSEYKKIKK
FLNI
Is it expected ?
How to interpret this ?
and why the default mode for --algo is not "query-indexed" because as explained in the documentation "the program will automatically choose one of the algorithms based on the size of the query and database files" thus it should be "query-indexed" because the query is very small ?
Best, David
This happens due to masking of frequent seeds. You can find the hits in ultra-sensitive mode by setting --freq-sd 50. It may be worth tuning some parameters here if you are interested in finding these very weak hits.
The query-indexed mode is chosen based on the size ratio of query and database file. We also plan to revise this mode so results will be consistent with the double-indexed mode.
Ok but I don't understand why in more-sensitive mode using --freq-sd 20 (the default value of ultra-sensitive) I found the same hits than with --freq-sd 200 (the default value of more-sensitive)
$ diamond blastp --more-sensitive --query query.faa --db bank.faa --evalue 1.0 --freq-sd 20 --algo 0
RIBOQ RIBO2 30.0 60 41 1 1 59 1 60 3.4e-06 31.2
$ diamond blastp --more-sensitive --query query.faa --db bank.faa --evalue 1.0 --freq-sd 20 --algo 1
RIBOQ RIBO1 41.4 58 33 1 1 57 1 58 6.1e-08 37.0
RIBOQ RIBO2 30.0 60 41 1 1 59 1 60 3.4e-06 31.2
$ diamond blastp --ultra-sensitive --query query.faa --db bank.faa --evalue 1.0 --freq-sd 20 --algo 0
#no hits
I suppose that in ultra-sensitive mode more seeds are found that why --freq-sd should be increased. Is this correct ?
What are the differences between more-sensitive and ultra-sensitive modes?
How to interpret the impact of the algo value in the more-sensitive mode?
Do you recommend to use --algo 1 (query indexed) for small query and database files made of few thousands (~5000) of proteins (i.e. our use case is the comparison of bacteria proteomes)?
Ok but I don't understand why in more-sensitive mode using
--freq-sd 20(the default value of ultra-sensitive) I found the same hits than with--freq-sd 200(the default value of more-sensitive)
The query-indexed and double-indexed modes are two different algorithms, and it is possible that one algorithm finds a particular alignment while the other one does not, even if the second algorithm is the more sensitive one overall.
The same is true for the more-sensitive vs. the ultra-sensitive mode. The latter mode is a lot more sensitive overall, but there is no 100% guarantee that every alignment found by the more sensitive mode is also found by the ultra-sensitive mode. These are side effects caused by heuristics that can't be completely avoided.
$ diamond blastp --more-sensitive --query query.faa --db bank.faa --evalue 1.0 --freq-sd 20 --algo 0 RIBOQ RIBO2 30.0 60 41 1 1 59 1 60 3.4e-06 31.2 $ diamond blastp --more-sensitive --query query.faa --db bank.faa --evalue 1.0 --freq-sd 20 --algo 1 RIBOQ RIBO1 41.4 58 33 1 1 57 1 58 6.1e-08 37.0 RIBOQ RIBO2 30.0 60 41 1 1 59 1 60 3.4e-06 31.2 $ diamond blastp --ultra-sensitive --query query.faa --db bank.faa --evalue 1.0 --freq-sd 20 --algo 0 #no hitsI suppose that in ultra-sensitive mode more seeds are found that why
--freq-sdshould be increased. Is this correct ?
It simply helps to increase the --freq-sd parameter in any case, because less seeds will be masked, at the cost of increased runtime.
What are the differences between more-sensitive and ultra-sensitive modes?
The ultra-sensitive mode is in general a lot more sensitive than the more-sensitive mode. If you are interested in finding hits with 30-40 bit score, it would be the better choice.
How to interpret the impact of the algo value in the more-sensitive mode?
You should use the query-indexed mode only if your query file is very small relative to the database, otherwise the double-indexed algorithm is better (you can adjust it to be more sensitive if needed).
Do you recommend to use
--algo 1(query indexed) for small query and database files made of few thousands (~5000) of proteins (i.e. our use case is the comparison of bacteria proteomes)?
If your query and database files are small, you should use the double-indexed mode.