server icon indicating copy to clipboard operation
server copied to clipboard

MDEV-33408 Update insert algorithm for HNSW index

Open HugoWenTD opened this issue 11 months ago • 2 comments

Description

Fix two issues in HNSW insert algorithm to improve the recall rate, according to the paper of HNSW https://arxiv.org/pdf/1603.09320.pdf. There are other parameters could be adjusted to have better recall, but only focus on following improvements in this PR.

Issue 1:

Based on the Algorithm 1 from the HNSW paper, when a new element q is inserted into the HNSW graph, the algorithm establishes bidirectional connections between q and its selected neighbors at each layer. After adding these connections, the algorithm checks if any of the neighbors have exceeded the maximum number of allowed connections (M_max) at that layer.

If a neighbor e has more than M_max connections at a particular layer lc, the algorithm shrinks the neighbor's connections by selecting a subset of M_max neighbors using either Algorithm 3 or Algorithm 4 from the paper. This subset becomes the new neighborhood for e at layer lc.

It's important to note that when shrinking the connections of a neighbor e, the algorithm only updates e's neighbor list by removing some of the outgoing connections from e. However, it does not remove the incoming connections to e from its previous neighbors. In other words, if e was previously connected to another element c, the algorithm removes the connection from e to c, but it does not remove the connection from c back to e.

This asymmetric removal of connections helps maintain the navigability of the HNSW graph while reducing the overall number of connections for efficient search and retrieval.

Now align 'update_neighbours' with Algorithm 1 (HNSW paper). When shrinking neighbor e's connections at layer lc due to exceeding M_max, remove only e's outgoing connections, not incoming ones.

Note: 'update_neighbours' seems not very efficient (flame graph analysis). Probably should consider refining it later.

Issue 2:

As per the paper, the maximum number of neighbors per layer (Mmax) for layer 0 should be set differently. It is recommended to set Mmax0 (the maximum connections for layer 0) to 2 * Mmax for better search performance and memory efficiency.

How can this PR be tested?

After fixing above two issues, the recall rate shows great improvement.

Using ann-benchmark tool and dataset of random-xs-20-euclidean:

  • Before this change:

    • total insert time: 48s
    • recall: 0.188
    • QPS: 1945
  • With the fix of issue 1:

    • total insert time: 92s
    • recall: 0.603
    • QPS: 1513
  • With the fix of both issue 1 and issue 2:

    • total insert time: 128s
    • recall: 0.737
    • QPS: 1270

You can observe that the insert time and search time have increased as expected, since it maintains more connections and prevents isolating elements within the graph.

The MTR for vector search passes:

==============================================================================

TEST                                      RESULT   TIME (ms) or COMMENT
--------------------------------------------------------------------------

worker[01] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
worker[01] mysql-test-run: WARNING: running this script as _root_ will cause some tests to be skipped
main.vector                              [ pass ]     24
***Warnings generated in error logs during shutdown after running tests: main.vector

Warning: Memory not freed: 59736

--------------------------------------------------------------------------

HugoWenTD avatar Mar 27 '24 00:03 HugoWenTD

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Mar 27 '24 00:03 CLAassistant

In addition, attach the flame graphs for reference ( generated by tool in my PR of Support vector ANN search benchmarking ):

Insert:

perf data inserting 2024-03-26-15-26-48

Search:

perf data searching 2024-03-26-15-26-48

HugoWenTD avatar Mar 27 '24 00:03 HugoWenTD

obsolete, fixed independently

vuvova avatar Jul 19 '24 06:07 vuvova