annoy icon indicating copy to clipboard operation
annoy copied to clipboard

Making it go fast for high volume queries

Open FlimFlamm opened this issue 1 year ago • 0 comments

Looking for any pointers/advice/best practices for my use case:

Large annoy tree (100GB +), high frequency lookups, best or near best accuracy required (wherever diminishing returns start to show i guess, which right now seems to be around search_k= 30_000 for 10M items each with 3500 components)

Essentially I need to non-stop sequentially lookup every item in the tree as fast as possible. At my desired search_k value, the performance hit is starting to hurt.

Side question: If i were to build another annoy index with as many build-trees as i can fit in memory/disk, would this significantly reduce my search_k requirements to get similar results? edit: answer: possibly yes; at least a bit

NOTE: currently, multi-processing across 2/3'rds of my cores appears to be fastest, which i suspect is due to I/O waiting times...

Tertiary question: Is a shared memory approach with one tree in memory and many processes accessing it achievable or useful?

Quaternary questions: is there a fastest metric? edit: answer: yes. In my case hamming turned out to be fastest, and counterintuitively, the most accurate by a keyword-based metric. (although if I normalize my continuous vector before hand (which should break hamming), the build time itself seems to increase drastically.

FlimFlamm avatar Aug 19 '24 00:08 FlimFlamm