server MDEV-33408 Alter HNSW graph storage and fix memory leak

Description

This commit changes the way HNSW graph information is stored in the second table. Instead of storing connections as separate records, it now stores neighbors for each node, leading to significant performance improvements and storage savings.

Comparing with the previous approach, the insert speed is 5 times faster, search speed improves by 23%, and storage usage is reduced by 73%, based on ann-benchmark tests with random-xs-20-euclidean and random-s-100-euclidean datasets.

Additionally, in previous code, vector objects were not released after use, resulting in excessive memory consumption (over 20GB for building the index with 90,000 records), preventing tests with large datasets. Now ensure that vectors are released appropriately during the insert and search functions. Note there are still some vectors that need to be cleaned up after search query completion. Needs to be addressed in a future commit.

How can this PR be tested?

Comparing the performance with previous method of saving connections, using ann-benchmark tool, significant improvement on insert/search/storage size can be seen:

To isolate the impact of the storage structure on performance, I basically kept the HNSW parameters consistent in the commit. Only different parameter is a EF_SEARCH=10 which could help improve the recall. "new" version is based on commit in this pull request and "old" version is of commit 00b613a2 which includes my fix in a previous pending PR. )

With dataset of random-xs-20-euclidean:

Metric	old	new	change
Insert time for 9000 records	109.41 sec	22.17 sec	(5 times faster)
K-ANN search recall	0.737	0.782	(+6%)
K-ANN search QPS	687.526	843.368	(+22%)
Second table size (data + index)	8.1M	2.16M	(-73%)

New second table data and index sizes t1#i#01.*:

-rw-rw---- 1 wenhug amazon 1.9M Apr 12 20:59 t1#i#01.MYD
-rw-rw---- 1 wenhug amazon 272K Apr 12 20:59 t1#i#01.MYI
-rw-rw---- 1 wenhug amazon 1.5K Apr 12 20:58 t1.frm
-rw-rw---- 1 wenhug amazon 809K Apr 12 20:59 t1.MYD
-rw-rw---- 1 wenhug amazon  93K Apr 12 20:59 t1.MYI

Old second table data and index sizes t1#i#01.*:

-rw-rw---- 1 wenhug amazon 5.0M Apr 12 17:16 t1#i#01.MYD
-rw-rw---- 1 wenhug amazon 3.1M Apr 12 17:17 t1#i#01.MYI
-rw-rw---- 1 wenhug amazon 1.5K Apr 12 17:15 t1.frm
-rw-rw---- 1 wenhug amazon 809K Apr 12 17:16 t1.MYD
-rw-rw---- 1 wenhug amazon  93K Apr 12 17:17 t1.MYI

With dataset of random-s-100-euclidean:

Metric	old	new	change
Insert time for 90000 records	1328.48 sec	259.15 sec	(5 times faster)
K-ANN search recall	0.221	0.335	(+50%)
K-ANN search QPS	539.019	668.827	(+24%)
Second table size (data + index)	79M	20.8M	(-73%)

New second table data and index sizes t1#i#01.*:

-rw-rw---- 1 wenhug amazon  18M Apr 12 17:30 t1#i#01.MYD
-rw-rw---- 1 wenhug amazon 2.8M Apr 12 17:33 t1#i#01.MYI
-rw-rw---- 1 wenhug amazon 1.5K Apr 12 17:26 t1.frm
-rw-rw---- 1 wenhug amazon  36M Apr 12 17:30 t1.MYD
-rw-rw---- 1 wenhug amazon 905K Apr 12 17:33 t1.MYI

Old second table data and index sizes t1#i#01.*:

-rw-rw---- 1 wenhug amazon  46M Apr 12 17:56 t1#i#01.MYD
-rw-rw---- 1 wenhug amazon  33M Apr 12 17:59 t1#i#01.MYI
-rw-rw---- 1 wenhug amazon 1.5K Apr 12 17:34 t1.frm
-rw-rw---- 1 wenhug amazon  36M Apr 12 17:56 t1.MYD
-rw-rw---- 1 wenhug amazon 905K Apr 12 17:59 t1.MYI

Apr 12 '24 23:04 HugoWenTD

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Apr 12 '24 23:04 CLAassistant

Assigned to @cvicentiu as it is against his branch.

Apr 15 '24 09:04 LinuxJedi

Hi @HugoWenTD!

Thank you for the contribution.

I will address your work in order of appearance:

The benchmark tool is imensely helpful. I will rebase it at the beggining of bb-11.4-vec-vicentiu for ease of use for all of us. We will have to clean it up a bit, and perhaps it should not reside in the MariaDB Server repo directly, but that's a problem for another time, not relevant during this development effort.

In any case, I have used it to double check your findings and I can reproduce everything (except the results for this particular PR, but more on this later)

The "bugfix" you suggested on only deleting one-directional connections: (PR: #3156) I've re-read the paper a few times and I strongly believe that the graph is only meant to be considered an undirected graph, hence all connections are bi-directional. If you can point out a paragraph to me (or any other source) where the original HNSW allows one-directional links, I am more than happy to reconsider.

My understanding (at this point in time) is that by "removing only a subset of links", as you've done in #3156 causes a soft increase in MAX_NEIGHBORS_PER_LAYER. The same effect can be achieved by just bumping the MAX_NEIGHBORS_PER_LAYER variable.

So I am inclined to reject that bugfix and instead focus on tuning the graph construction parameters.

Here is a set of results I got using the following constants, from the commit 2f6b2b3b7d582b9a483a0ab98448a9ad745d3e51:

  const uint EF_CONSTRUCTION = 32; // max candidate list size to connect to.
  const uint MAX_INSERT_NEIGHBOUR_CONNECTIONS = 100;
  const uint MAX_NEIGHBORS_PER_LAYER = 100;
  const double NORMALIZATION_FACTOR = 1.2;

without any of your changes in PR #3156

  0:    MariaDB(m=16, ef_construction=200, ef_search=10)        0.906     1035.211

So we can achieve 90% recall by having more neighbors per layer and bumping eF_construction. Of course, the QPS drops, as the graph is bigger, but as you've noticed, the code was initially written for correctness, not for speed. Once we've proven corectness, we can look into speed improvements.

I tried your code from this PR (437e21444ceec6992cd35f4449632a5dec07a9bf). I get the following results, which I believe are due to a bug somewhere:

  0:    MariaDB(m=16, ef_construction=200, ef_search=10)        0.005     5095.530`

Can you double check your findings and see if you've submitted the correct code to this PR?

In general I agree with your approach of using an adjancency list (layer, src, neighbors[]) instead of tuples (layer, src, dst).

This approach has some benefits for sure:

The number of rows in the table is drastically reduced. (reducing index size)
There is less data duplication in the table. (again reducing index size)
We are more memory friendly and can thus run searches faster. (I believe we should optimize for SELECT, not for INSERT/UPDATE/DELETE, but of course we can not completely slow down INSERTS)
Because of less data to be written to disk, INSERTS should be faster (as your benchmark shows)
The adjacency list approach removes a lot of ha_index_next calls because it loads all neighbours in memory on one lookup. That's great! The downsides:
Updating a connection will cause a much larger write in the storage engine. We need to benchmark with longer neighbour lists (100 neighbors max for instance).
Delete may be trickier to implement. I had started on working on delete with my version of code (layer, src, dst). But I think that your solution has more promise. The way I'm implementing delete is by using a "deleted" bit for a node. If the node is deleted, we skip it during all iterations / searches. I can probably adapt the solution to work with your code. I'll try this right now and share the result as soon as it's working.
Your benchmark shows that we need to be able to parameterize the index construction at runtime so we can expriment without recompiling. Would you be interested in tackling that bit? You can create some system variables for now and we'll use those. A quick and dirty solution, just to map with your already written benchmark code. I noticed you have some commented out SET commands that you wanted to use:

ann_benchmarks/algorithms/mariadb/module.py
288:        #self._cur.execute("SET hnsw.ef_search = %d" % ef_search)

That's all the thoughts I have for now, let me know when you've double checked the benchmark results for your code and in case I messed up when running it, I'll fix it.

Apr 15 '24 10:04 cvicentiu

@HugoWenTD I've done a little bit more experimenting with your bugfix from #3156. For certain combinations of parameters, it does seem to offer a marginal improvement in both QPS and recall, but I'd like to test this more systematically.

Apr 15 '24 12:04 cvicentiu

Hi @cvicentiu , Thank you for reviewing the pull requests and provide feedback on the benchmarking tool. I really appreciate your input.

I'm traveling for meetings this week but I will try to address on the items you mentioned later this week or next week. You made an important point that the mMax (max neighbours) affects the update speed when using node->neighbours[]. I believe I did run tests with some bigger mMax values. but for the comparison with previous implementation I only recorded the results of the default parameters. I will gather more data on the impact of mMax and get back to you.

Regarding your comment:

I've re-read the paper a few times and I strongly believe that the graph is only meant to be considered an undirected graph, hence all connections are bi-directional. If you can point out a paragraph to me (or any other source) where the original HNSW allows one-directional links, I am more than happy to reconsider.

Here are a few points supporting the use of one-directional links:

The algorithm 1 in the paper describes the shrink steps:

At line 16, set neighbourhood(e) at layer lc to eNewConn is to shrink only one-direction connection, and it does not mention deleting the connection of the opposite direction.

In addition, PgVector's shrinking step is in this section.
It also shrinks only one direction.
In the worst case, there could be a scenario where vectors gets isolated:
- Assume mMax-level0 is 4
- When (V5) vector 5 is inserted, V4 got isolated.
- If V4 happens to be entry point of higher level. That means none of V1/2/3/5 will be ever traversed in future insert or search.
- While increasing mMax could mitigate this, it still reduces the navigability of the graph, in the benchmark tests, I did see lower recall caused by fewer vectors were returned than the specified limit.

Apr 17 '24 04:04 HugoWenTD

Your benchmark shows that we need to be able to parameterize the index construction at runtime so we can expriment without recompiling. Would you be interested in tackling that bit? You can create some system variables for now and we'll use those. A quick and dirty solution, just to map with your already written benchmark code. I noticed you have some commented out SET commands that you wanted to use:

Hi @cvicentiu, I just created this PR https://github.com/MariaDB/server/pull/3226 targeting your newly created branch bb-11.4-vec-vicentiu-hugo, for configuring the parameters through system variables. Please be aware that the modification in ann-benchmark is not backward compatible, so I have created a new branch to accommodate this change.

Apr 26 '24 02:04 HugoWenTD

Updating a connection will cause a much larger write in the storage engine. We need to benchmark with longer neighbour lists (100 neighbors max for instance).

Sharing some benchmark results with different m value using DATASET=random-s-100-euclidean. The insert time and search time increased as expected, but still acceptable I think. Will test some more combinations with different ef_construction values.

0: MariaDB(m=100, ef_construction=10, ef_search=10)
recall: 0.846
QPS: 669.383 Insert time : 252.37262177467346

1: MariaDB(m=50, ef_construction=10, ef_search=10)
recall: 0.825
QPS: 678.479 Insert time : 258.6093146800995

2: MariaDB(m=10, ef_construction=10, ef_search=10)
recall: 0.335
QPS:1052.397 Insert time : 192.79994201660156

May 07 '24 00:05 HugoWenTD

partially included info e317023, the rest is obsolete

Jul 19 '24 06:07 vuvova