knowhere
knowhere copied to clipboard
Implement brute force search in hnsw
Milvus introduces the concept of a bitmask to the filtering mechanism to keep similar vectors with a bitmask of 1 based on satisfying specific attribute filters.
Currently knowhere will accept a parameter bitmask when doing hnsw search. When the proportion of valid data marked in the bitmask is small, the performance of hnsw continues to degrade, which is slower than brute force search.
In the above figure, the X-axis represents the number of valid data in 1 million rows of data
The hnsw index structure contains raw vector data, so it is feasible to add a brute force search process to hnsw.
Suggest: On the milvus side, according to a simple strategy (such as the percentage of valid data), an additional bool parameter is passed to knohwere, indicating whether it needs to go the brute forece search path.
From the picture, when 1 million data is filtered out more than 960000, bruteforce search will be faster than HNSW search. I have two considerations:
- Whether the HNSW processing can be optimized in this case? In this case, the processing logic of HNSW vs. bruteforce is the random access vs. continuous access in RAM?
- The situation tha filtering out more than 95% is an extreme situation or occurs frequently; if it actually happens rarely, for example, only appears once every 1 million runs, which feels negligible. So here we keep an eye on this issue, will follow up to inverstigate on how to solve it.
test config: M : 8 efConstruction : 20 dim : 128 data : random ef : 10 topK: 10
Suggest: On the milvus side, according to a simple strategy (such as the percentage of valid data), an additional bool parameter is passed to knohwere, indicating whether it needs to go the brute forece search path.
Hi @czs007 , I don't think it's a good idea to add brute force search logic into HNSW. I believe each index algorithm has its applicable scenario, user should analyze their scenario then decide to use which index type. For this case, if HNSW run slower than brute force when bitset filtering out 95% vectors, segcore can handle this judgment and use IDMAP instead of HNSW. This logic should not be added into HNSW algorith.