hnswlib
hnswlib copied to clipboard
Python: filter elements with an optional filtering function
Summary of the change
Expose filtering functionality introduced in #402 in Python API. Changes are kept to a minimum, only HNSW index is implemented and not brute force.
For a discussion of the interface design see here.
Preliminary performance characteristics for filtering (not strictly related to the changes):
filter | queries per second |
---|---|
0.1 | 4241.43 |
0.2 | 5993.89 |
0.3 | 8232.54 |
0.4 | 10688.06 |
0.5 | 11677.54 |
0.6 | 13242.56 |
0.7 | 14520.50 |
0.8 | 14659.25 |
0.9 | 14932.67 |
none | 503578.34 |
Here filter denotes the fraction of elements the query was restricted to. none denotes that no filtering has been applied.
As per the above table, there is a threshold below which exact ANN is probably preferable.
I did some benchmarks with a 150k vector dataset of 64 dimensions with inner product.(will have to repeat on a public dataset in order to be able to share details). The general finding is, that the introduced filter function only gives competitive results vs. brute force, when only a small fraction of items are excluded from the 150k vector set. So this is an important info that users of the filter function must be aware of, it can be useful, but in some scenarios there is extreme performance degradation.
Hi @gtsoukas, Thank you for the PR! We will review it and test it shortly.
Thank you @yurymalkov ,
it would be great to have this functionality. Here are some benchmarks demonstrating on a public dataset that the filtering functionality for medium to large fractions of the search space clearly outperforms brute force search and a server based solution that supports filtered ANN. https://github.com/gtsoukas/filtered-ann-benchmarks.
Thank you so much for the review @dyashuni!
I don't know what to do about the last comment (line 613). Happy to dig deeper if you could give me a direction.
@gtsoukas Thank you for the updates! Let me discuss with @yurymalkov about the code changes. There are possible options here.
@gtsoukas Thank you!
I made several changes above yours: removed template filter_func_t, added filter to brute force index and updated tests.
I attached the patch file.
Could you please apply it and update the pull request?
git apply patch.txt
patch.txt
Comments are welcome!
@dyashuni, thanks a million times for the patch! I have applied the patch as-is.
All my functional and performance tests look good after application of the patch.
For someone not very familiar with C++, I find that the changes you made make the library easier to use.
I don't have to add anything. It is ready to merge from my side.
One thing that maybe could be improved in the context of this PR but not strictly related, is that the concepts of label, id and index could be sharpened: My understanding is, that when using the Python bindings, we always use indices but never ids or labels. As a first step this could be made more clear by not calling parameters "id" (e.g. line 877 of python_bindings/bindings.py). Ideally, there would be a way to handle actual labels/ids via the Python bindings. For me, the difference between an index on the one side and the label/id on the other side is that the first is restricted to natural integers, ideally without gaps whereas the latter can be of any type. Currently I am maintaining separate index/label mappings, when using the python version of hnswlib. This is however a tiny criticism for an awesome library!
@gtsoukas Thank you!
I’m not sure that I get everything correctly.
There is inconsistency in naming. In hnswalg.h we have "labels". In python_bindings we have "ids". Actually they mean the same.
In python when we add items (add_items
function) we can also specify corresponding labels/ids via ids
parameter. If the ids
parameter is not specified then items get sequential labels/ids (0, 1, 2, …). The labels/ids are used to retrieve the items in get_items
function.
In python when we add items (
add_items
function) we can also specify corresponding labels/ids viaids
parameter. If theids
parameter is not specified then items get sequential labels/ids (0, 1, 2, …). The labels/ids are used to retrieve the items inget_items
function.
Thanks for explaining @dyashuni !
I was using the id
parameter of the Python add_item
method but with non-integer ids and received an error (ValueError: invalid literal for int() with base 10: ' 58725ab=>'
). From this I probably jumped to the wrong conclusion that hnswlib (python version) does not handle ids whereas it actually handles ids as long as they are integers.
Best forget my comment, as it is unrelated to this pull request anyway and I will write a new issue once I have somethin more concrete.
@gtsoukas Got it, thank you! Yes, we now support only integer ids. If you would like to use non-integer ids in your applications you can create mapping between non-integer ids to integer ids, and use the latter in hnswlib. It is not very convenient of course.