vsearch icon indicating copy to clipboard operation
vsearch copied to clipboard

--weak_id not working

Open dleopold opened this issue 3 months ago • 7 comments

I am testing the --weak_id parameter for usearch_global, but it does not appear to be working. When I run a query with --id 0.85 I get ~10,000 of my query sequences matching my database. When I use --id 0.95 and --weak_id 0.85 I only get ~8,000 matches, which is the same as using just --id 0.95 without the weak_id parameter. I may be misinterpreting the way this parameter is supposed to work, but this does not seem right to me.

dleopold avatar Mar 26 '24 21:03 dleopold

@dleopold I think you are right, --weak_id does not seem to behave as it should. Sorry about that. Please be patient, the issue will be investigated after the Easter break.

My notes so far:

To illustrate the issue, here is a simple example. With q1 (query 1), search should report a weak hit with s2 (98.5% match) and a strong hit with s1 (100% match):

q1="AAACAAGAATACCACGACTAGCAGGAGTATCATGATTCCCGCCTCGGCGTCTGCTTGGGTGTTTAA"
s1="AAACAAGAATACCACGACTAGCAGGAGTATCATGATTCCCGCCTCGGCGTCTGCTTGGGTGTTTAA"  # perfect match
s2="AAACAAGAATACCACGACTAGCAGGAGTATCATGCTTCCCGCCTCGGCGTCTGCTTGGGTGTTTAA"  # weaker match (98.5%)
#                        substitution ^

## expect a weak hit with s2 and a strong hit with s1
vsearch \
    --usearch_global <(printf ">q1\n%s\n" "${q1}") \
    --db <(printf ">s2\n%s\n>s1\n%s\n" "${s2}" "${s1}") \
    --id 1.0 \
    --weak_id 0.98 \
    --quiet \
    --uc -

Instead, it only reports a hit with s1 (perfect match):

H	1	66	100.0	+	0	0	=	q1	s1

For future reference, here is the current documentation for --weak_id:

--weak_id real Show hits with percentage of identity of at least real, without terminating the search. A normal search stops as soon as enough hits are found (as defined by --maxaccepts, --maxrejects, and --id). As --weak_id reports weak hits that are not deduced from --maxaccepts, high --id values can be used, hence preserving both speed and sensitivity. Logically, real must be smaller than the value indicated by --id.

frederic-mahe avatar Apr 01 '24 17:04 frederic-mahe

Yes, there seems to be something wrong here. I'm looking into it. it seems like the weak hits found, but not shown. Thanks for reporting the issue.

torognes avatar Apr 04 '24 13:04 torognes

There is something wrong here, but be aware that the weak_id option may not work as expected. It is described here:

https://www.drive5.com/usearch/manual/weak_hits.html

It should only report weak hits as long as it is still scanning for true hits. Scanning will be terminated when the specified maximum number of accepted or rejected hits (specified with maxaccepts and maxrejects) has been reached . The results of using weak_id are therefore somewhat unpredictable. It's probably only useful to see if there are may be any weak hits as long as there are no true hits.

In the example above there is a true hit, and as long as maxaccepts is not more than 1, the weak hit will usually not be reported (unless rare cases where it is found before the true hit).

torognes avatar Apr 04 '24 13:04 torognes

It does not seem like usearch supports the weak_id option anymore, at least it seems to be ignored in version 11.0.667.

torognes avatar Apr 04 '24 15:04 torognes

I think it should work as intended with the latest commit (768da2f), but note that one needs to remove s1 or increase maxaccepts in the example above to see the weak hit.

torognes avatar Apr 04 '24 15:04 torognes

Thanks for working on this! For clarity, do hits with a %id below --id but above --weak_id count towards --maxrejects? In practice, using the --weak_id parameter to effectively manage search efficiency would seem to require something like a --max_weak_hits parameter. I would imagine most users of this feature want to stop searching quickly if a hit above --id is found, but would not want maxrejects to be the only stopping heuristic for weak hits since maxrejects is usually set significantly higher than maxaccepts.

dleopold avatar Apr 04 '24 15:04 dleopold

Yes, hits below id counts towards maxrejects.

torognes avatar Apr 04 '24 16:04 torognes

Yes, hits below id counts towards maxrejects.

corresponding tests have been added to our test-suite (see https://github.com/frederic-mahe/vsearch-tests/commit/865e95f8555218ce98603e10046afd7c5ebbae5e), and I've also modified a bit the description of --weak_id in the manpage https://github.com/torognes/vsearch/commit/452a5e1dd03b6af8bb677391d00df8b98f6d9a09

I am closing that issue, @dleopold thanks for reporting that issue.

frederic-mahe avatar May 03 '24 15:05 frederic-mahe