diamond influence of algo mode in sensitivity

Hi,

I have a test case where the --algo parameter as influence in the sensitivity when comparing more-sensitive and ultra-sensitive modes (diamond version 2.0.4)

The more-sensitive mode with --algo 1 (query-indexed) retrieves a hit that is not found with --algo 0 (double-indexed, default chosen mode) and neither with the ultra-sensitive mode.

$ diamond blastp --more-sensitive  --query query.faa --db bank.faa --evalue 1.0 
RIBOQ	RIBO2	30.0	60	41	1	1	59	1	60	3.4e-06	31.2
$ diamond blastp --more-sensitive  --query query.faa --db bank.faa --evalue 1.0  --algo 0 #force double-indexed, the default chosen mode 
RIBOQ	RIBO2	30.0	60	41	1	1	59	1	60	3.4e-06	31.2
$ diamond blastp --more-sensitive  --query query.faa --db bank.faa --evalue 1.0  --algo 1 #force query-indexed
RIBOQ	RIBO1	41.4	58	33	1	1	57	1	58	6.1e-08	37.0
RIBOQ	RIBO2	30.0	60	41	1	1	59	1	60	3.4e-06	31.2
$ diamond blastp --ultra-sensitive  --query query.faa --db bank.faa  --evalue 1.0
# return no hit

The protein files:

$ cat query.faa
>RIBOQ
MAKLKTRRGAAKRFKATANGFKRKQAFKRHILTKKSAKRIRQLRGCVMVHVSDMNSVRRM
CPYI

$ cat bank.faa
>RIBO1
MPKAKTHSGASKRFRRTGTGKIVRQKANRRHLLEHKPSTRTRRLDGRTVVAANDTKRVTS
LLNG
>RIBO2
MPKIKTKKSFTKRFRITKNGIILRRSTGLNHYRSKKTGQQVRNSRKMVRISDSEYKKIKK
FLNI

Is it expected ? How to interpret this ?
and why the default mode for --algo is not "query-indexed" because as explained in the documentation "the program will automatically choose one of the algorithms based on the size of the query and database files" thus it should be "query-indexed" because the query is very small ?

Best, David

Nov 24 '20 15:11 dvallenet

This happens due to masking of frequent seeds. You can find the hits in ultra-sensitive mode by setting --freq-sd 50. It may be worth tuning some parameters here if you are interested in finding these very weak hits.

The query-indexed mode is chosen based on the size ratio of query and database file. We also plan to revise this mode so results will be consistent with the double-indexed mode.

Nov 24 '20 20:11 bbuchfink

Ok but I don't understand why in more-sensitive mode using --freq-sd 20 (the default value of ultra-sensitive) I found the same hits than with --freq-sd 200 (the default value of more-sensitive)

$ diamond blastp --more-sensitive  --query query.faa --db bank.faa --evalue 1.0 --freq-sd 20 --algo 0
RIBOQ	RIBO2	30.0	60	41	1	1	59	1	60	3.4e-06	31.2

$ diamond blastp --more-sensitive  --query query.faa --db bank.faa --evalue 1.0 --freq-sd 20 --algo 1
RIBOQ	RIBO1	41.4	58	33	1	1	57	1	58	6.1e-08	37.0
RIBOQ	RIBO2	30.0	60	41	1	1	59	1	60	3.4e-06	31.2

$ diamond blastp --ultra-sensitive  --query query.faa --db bank.faa --evalue 1.0 --freq-sd 20 --algo 0
#no hits

I suppose that in ultra-sensitive mode more seeds are found that why --freq-sd should be increased. Is this correct ? What are the differences between more-sensitive and ultra-sensitive modes?

How to interpret the impact of the algo value in the more-sensitive mode? Do you recommend to use --algo 1 (query indexed) for small query and database files made of few thousands (~5000) of proteins (i.e. our use case is the comparison of bacteria proteomes)?

Nov 25 '20 20:11 dvallenet

Ok but I don't understand why in more-sensitive mode using --freq-sd 20 (the default value of ultra-sensitive) I found the same hits than with --freq-sd 200 (the default value of more-sensitive)

The query-indexed and double-indexed modes are two different algorithms, and it is possible that one algorithm finds a particular alignment while the other one does not, even if the second algorithm is the more sensitive one overall.

The same is true for the more-sensitive vs. the ultra-sensitive mode. The latter mode is a lot more sensitive overall, but there is no 100% guarantee that every alignment found by the more sensitive mode is also found by the ultra-sensitive mode. These are side effects caused by heuristics that can't be completely avoided.

$ diamond blastp --more-sensitive  --query query.faa --db bank.faa --evalue 1.0 --freq-sd 20 --algo 0
RIBOQ	RIBO2	30.0	60	41	1	1	59	1	60	3.4e-06	31.2

$ diamond blastp --more-sensitive  --query query.faa --db bank.faa --evalue 1.0 --freq-sd 20 --algo 1
RIBOQ	RIBO1	41.4	58	33	1	1	57	1	58	6.1e-08	37.0
RIBOQ	RIBO2	30.0	60	41	1	1	59	1	60	3.4e-06	31.2

$ diamond blastp --ultra-sensitive  --query query.faa --db bank.faa --evalue 1.0 --freq-sd 20 --algo 0
#no hits

I suppose that in ultra-sensitive mode more seeds are found that why --freq-sd should be increased. Is this correct ?

It simply helps to increase the --freq-sd parameter in any case, because less seeds will be masked, at the cost of increased runtime.

What are the differences between more-sensitive and ultra-sensitive modes?

The ultra-sensitive mode is in general a lot more sensitive than the more-sensitive mode. If you are interested in finding hits with 30-40 bit score, it would be the better choice.

How to interpret the impact of the algo value in the more-sensitive mode?

You should use the query-indexed mode only if your query file is very small relative to the database, otherwise the double-indexed algorithm is better (you can adjust it to be more sensitive if needed).

Do you recommend to use --algo 1 (query indexed) for small query and database files made of few thousands (~5000) of proteins (i.e. our use case is the comparison of bacteria proteomes)?

If your query and database files are small, you should use the double-indexed mode.

Nov 25 '20 21:11 bbuchfink