sourmash
sourmash copied to clipboard
[WIP] Remove min_n_below from search code
There are three main cases when searching an index in sourmash (I'll use Q
as query signature and s
as a signature/internal node for the analysis):
Case 1: similarity searches
We want to find any signature that is above a similarity threshold in a collection:
J(Q, s) = |Q ∩ s| / |Q ∪ s|
. Since |Q| <= |Q ∪ s|
, we can use |Q|
as an upper bound (because it can overestimate the similarity, but never underestimate).
Case 2: containment searches
We want to find what signatures are above a containment threshold to the query (with sourmash search --containment
) or which one is the best match (maximizes the query containment in the signature), as in sourmash gather
.
C(Q, s) = |Q ∩ s| / |Q|
. Similar to case 1, we just need |Q|
.
Case 3: reverse containment searches
We want to find what signatures are above a containment threshold to the query like case 2, but we want to maximize the signature containment (not the query containment).
C(s, Q) = |Q ∩ s| / |s|
.
This is the case where min_n_below
is necessary to bound |s|
. It is also not used in any place in sourmash.
min_n_below
is too pessimistic, and ends up making search slower.
Because the only place we really need min_n_below
is not a use case for sourmash, the consequence is that min_n_below
is too pessimistic when pruning subtrees during search, and it ends up making search slower.
This all came from the realization that I was thinking about gather
backwards. The best match in gather
maximizes C(Q,s)
and not C(s,Q)
. An example: let's say there is a small viral scaled MinHash s
with only one hash. If gather
was implemented with C(s, Q)
and the hash is in Q
, then s
would be the best match (because C(s, Q) = 1
). But maximizing C(Q, s)
is really finding the largest intersection between the query and a signature in an index, and so if there is a larger scaled MinHash s'
with smaller C(s', Q)
but larger C(Q, s')
, then it is going to be selected (and its hashes will be removed from the query for the next round).
TODO
- [ ] More tests, I'm not so sure current tests are checking for corner cases...
Checklist
- [ ] Is it mergeable?
- [ ]
make test
Did it pass the tests? - [ ]
make coverage
Is the new code covered? - [ ] Did it change the command-line interface? Only additions are allowed without a major version increment. Changing file formats also requires a major version number increment.
- [ ] Was a spellchecker run on the source code and documentation after changes were made?
Codecov Report
Merging #1137 (cf61c97) into latest (eb2b210) will decrease coverage by
0.18%
. The diff coverage is59.25%
.
@@ Coverage Diff @@
## latest #1137 +/- ##
==========================================
- Coverage 89.58% 89.39% -0.19%
==========================================
Files 122 122
Lines 18989 19001 +12
Branches 1455 1448 -7
==========================================
- Hits 17011 16986 -25
- Misses 1750 1796 +46
+ Partials 228 219 -9
Flag | Coverage Δ | |
---|---|---|
python | 94.68% <94.11%> (-0.12%) |
:arrow_down: |
rust | 67.00% <0.00%> (-0.38%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
src/core/src/ffi/nodegraph.rs | 0.00% <0.00%> (ø) |
|
src/core/src/sketch/nodegraph.rs | 79.51% <0.00%> (-4.37%) |
:arrow_down: |
src/sourmash/sbt.py | 81.24% <ø> (-2.72%) |
:arrow_down: |
tests/test_sourmash.py | 99.71% <ø> (-0.01%) |
:arrow_down: |
src/sourmash/nodegraph.py | 71.05% <50.00%> (-0.77%) |
:arrow_down: |
src/sourmash/sbtmh.py | 93.69% <100.00%> (+1.32%) |
:arrow_up: |
tests/test_sbt.py | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update eb2b210...cf61c97. Read the comment docs.
ok, this is pretty neat :). On first glance, I agree with your calculations!
(I also agree about More Testing Needed for edge cases...)
Need more checks, and the machine I ran the benchmarks was running other processes too, but... wow.
I updated the notebook in luizirber/2021-04-17-angular-bound, where I'm doing some analysis on the proposed upper bound for angular similarity. I ended up going with the second approach because it overestimates way less than the first one.
I used hypothesis
to generate 5000 examples and then selected cases where the angular similarity was larger than 0.1
and the difference between angular similarity and the upper bound was larger than 0.1
. When I plot an histogram for the first approach (a_sq += 1
), this is the result:
and for the second approach (a_sq += abund * abund
):
@luizirber is this worth pursuing (or at least keeping the PR up to date :), or has this been superseded by other changes?