sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

[WIP] Remove min_n_below from search code

Open luizirber opened this issue 3 years ago • 6 comments

There are three main cases when searching an index in sourmash (I'll use Q as query signature and s as a signature/internal node for the analysis):

Case 1: similarity searches

We want to find any signature that is above a similarity threshold in a collection: J(Q, s) = |Q ∩ s| / |Q ∪ s|. Since |Q| <= |Q ∪ s|, we can use |Q| as an upper bound (because it can overestimate the similarity, but never underestimate).

Case 2: containment searches

We want to find what signatures are above a containment threshold to the query (with sourmash search --containment) or which one is the best match (maximizes the query containment in the signature), as in sourmash gather. C(Q, s) = |Q ∩ s| / |Q|. Similar to case 1, we just need |Q|.

Case 3: reverse containment searches

We want to find what signatures are above a containment threshold to the query like case 2, but we want to maximize the signature containment (not the query containment). C(s, Q) = |Q ∩ s| / |s|. This is the case where min_n_below is necessary to bound |s|. It is also not used in any place in sourmash.

min_n_below is too pessimistic, and ends up making search slower.

Because the only place we really need min_n_below is not a use case for sourmash, the consequence is that min_n_below is too pessimistic when pruning subtrees during search, and it ends up making search slower.

This all came from the realization that I was thinking about gather backwards. The best match in gather maximizes C(Q,s) and not C(s,Q). An example: let's say there is a small viral scaled MinHash s with only one hash. If gather was implemented with C(s, Q) and the hash is in Q, then s would be the best match (because C(s, Q) = 1). But maximizing C(Q, s) is really finding the largest intersection between the query and a signature in an index, and so if there is a larger scaled MinHash s' with smaller C(s', Q) but larger C(Q, s'), then it is going to be selected (and its hashes will be removed from the query for the next round).

TODO

  • [ ] More tests, I'm not so sure current tests are checking for corner cases...

Checklist

  • [ ] Is it mergeable?
  • [ ] make test Did it pass the tests?
  • [ ] make coverage Is the new code covered?
  • [ ] Did it change the command-line interface? Only additions are allowed without a major version increment. Changing file formats also requires a major version number increment.
  • [ ] Was a spellchecker run on the source code and documentation after changes were made?

luizirber avatar Aug 02 '20 15:08 luizirber

Codecov Report

Merging #1137 (cf61c97) into latest (eb2b210) will decrease coverage by 0.18%. The diff coverage is 59.25%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1137      +/-   ##
==========================================
- Coverage   89.58%   89.39%   -0.19%     
==========================================
  Files         122      122              
  Lines       18989    19001      +12     
  Branches     1455     1448       -7     
==========================================
- Hits        17011    16986      -25     
- Misses       1750     1796      +46     
+ Partials      228      219       -9     
Flag Coverage Δ
python 94.68% <94.11%> (-0.12%) :arrow_down:
rust 67.00% <0.00%> (-0.38%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/core/src/ffi/nodegraph.rs 0.00% <0.00%> (ø)
src/core/src/sketch/nodegraph.rs 79.51% <0.00%> (-4.37%) :arrow_down:
src/sourmash/sbt.py 81.24% <ø> (-2.72%) :arrow_down:
tests/test_sourmash.py 99.71% <ø> (-0.01%) :arrow_down:
src/sourmash/nodegraph.py 71.05% <50.00%> (-0.77%) :arrow_down:
src/sourmash/sbtmh.py 93.69% <100.00%> (+1.32%) :arrow_up:
tests/test_sbt.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update eb2b210...cf61c97. Read the comment docs.

codecov[bot] avatar Aug 02 '20 15:08 codecov[bot]

ok, this is pretty neat :). On first glance, I agree with your calculations!

ctb avatar Aug 02 '20 16:08 ctb

(I also agree about More Testing Needed for edge cases...)

ctb avatar Aug 02 '20 16:08 ctb

Need more checks, and the machine I ran the benchmarks was running other processes too, but... wow.

image

luizirber avatar Aug 02 '20 16:08 luizirber

I updated the notebook in luizirber/2021-04-17-angular-bound, where I'm doing some analysis on the proposed upper bound for angular similarity. I ended up going with the second approach because it overestimates way less than the first one.

I used hypothesis to generate 5000 examples and then selected cases where the angular similarity was larger than 0.1 and the difference between angular similarity and the upper bound was larger than 0.1. When I plot an histogram for the first approach (a_sq += 1), this is the result: image

and for the second approach (a_sq += abund * abund): image

luizirber avatar Apr 19 '21 17:04 luizirber

@luizirber is this worth pursuing (or at least keeping the PR up to date :), or has this been superseded by other changes?

ctb avatar Aug 21 '22 14:08 ctb