annoy
annoy copied to clipboard
annoy parameters consideration for getting the best search match
Hello,
In our use-case - we are testing Annoy in different sizes of index – sizes are between 1K vectors to 2M . we are using 2 methods:
- a.build(n_trees)
- a.get_nns_by_vector(v, n, search_k, include_distances)
We did some tests while setting different params:
n_trees = [50, 150] k_search = [-1(default), 5000, 15000, 25000, 50000] (when k_search is constant, then the 'approximate nearest neighbours' is set to 100) n (approximate nearest neighbours) = [100, 130, 150, .. 200, ..., 400, ...] (in this scenario 'k_search' was set to default, and was influenced by 'n')
We ran few tests while permutating the params.
We saw that sometimes the expected result was not one of the values that we get. Meaning – there was a better high match result that we didn’t get.
For example, the docs say that the higher n_trees when building the index, the merrier (let's assume I have enough disk and memory), but in reality it actually decreased the accuracy of the results we expected.
In addition, the bigger 'n' and/or 'k_search' values we provided, the better results we received.
How can we ensure or at least raise the chances that the expected result will be returned? What is the parameters consideration we should take? Is it derived from the size of the index? I guess always raising the k_search to be as high as possible is not the correct solution as index size change and the 'query' vector change (The lower the accuracy, the higher k_search/n we need to use ... )
Thanks a lot
bump
My suggestion is to set n_trees
as high as you can where you can afford the build time and the index still fits in RAM
Then set search_k
as a tradeoff between recall and query time – higher search_k
will improve recall, at the cost of longer search times.