knowhere icon indicating copy to clipboard operation
knowhere copied to clipboard

Implement brute force search in hnsw

Open czs007 opened this issue 2 years ago • 3 comments

Milvus introduces the concept of a bitmask to the filtering mechanism to keep similar vectors with a bitmask of 1 based on satisfying specific attribute filters.

Currently knowhere will accept a parameter bitmask when doing hnsw search. When the proportion of valid data marked in the bitmask is small, the performance of hnsw continues to degrade, which is slower than brute force search.

image In the above figure, the X-axis represents the number of valid data in 1 million rows of data

The hnsw index structure contains raw vector data, so it is feasible to add a brute force search process to hnsw.

Suggest: On the milvus side, according to a simple strategy (such as the percentage of valid data), an additional bool parameter is passed to knohwere, indicating whether it needs to go the brute forece search path.

czs007 avatar Jul 12 '22 06:07 czs007

From the picture, when 1 million data is filtered out more than 960000, bruteforce search will be faster than HNSW search. I have two considerations:

  1. Whether the HNSW processing can be optimized in this case? In this case, the processing logic of HNSW vs. bruteforce is the random access vs. continuous access in RAM?
  2. The situation tha filtering out more than 95% is an extreme situation or occurs frequently; if it actually happens rarely, for example, only appears once every 1 million runs, which feels negligible. So here we keep an eye on this issue, will follow up to inverstigate on how to solve it.

landyliu avatar Jul 12 '22 07:07 landyliu

test config: M : 8 efConstruction : 20 dim : 128 data : random ef : 10 topK: 10

foxspy avatar Jul 13 '22 01:07 foxspy

Suggest: On the milvus side, according to a simple strategy (such as the percentage of valid data), an additional bool parameter is passed to knohwere, indicating whether it needs to go the brute forece search path.

Hi @czs007 , I don't think it's a good idea to add brute force search logic into HNSW. I believe each index algorithm has its applicable scenario, user should analyze their scenario then decide to use which index type. For this case, if HNSW run slower than brute force when bitset filtering out 95% vectors, segcore can handle this judgment and use IDMAP instead of HNSW. This logic should not be added into HNSW algorith.

cydrain avatar Jul 14 '22 06:07 cydrain