MDEV-33408 Alter HNSW graph storage and fix memory leak
Description
This commit changes the way HNSW graph information is stored in the second table. Instead of storing connections as separate records, it now stores neighbors for each node, leading to significant performance improvements and storage savings.
Comparing with the previous approach, the insert speed is 5 times faster, search speed improves by 23%, and storage usage is reduced by 73%, based on ann-benchmark tests with random-xs-20-euclidean and random-s-100-euclidean datasets.
Additionally, in previous code, vector objects were not released after use, resulting in excessive memory consumption (over 20GB for building the index with 90,000 records), preventing tests with large datasets. Now ensure that vectors are released appropriately during the insert and search functions. Note there are still some vectors that need to be cleaned up after search query completion. Needs to be addressed in a future commit.
How can this PR be tested?
Comparing the performance with previous method of saving connections, using ann-benchmark tool, significant improvement on insert/search/storage size can be seen:
To isolate the impact of the storage structure on performance, I basically kept the HNSW parameters consistent in the commit. Only different parameter is a EF_SEARCH=10 which could help improve the recall. "new" version is based on commit in this pull request and "old" version is of commit 00b613a2 which includes my fix in a previous pending PR. )
-
With dataset of random-xs-20-euclidean:
Metric old new change Insert time for 9000 records 109.41 sec 22.17 sec (5 times faster) K-ANN search recall 0.737 0.782 (+6%) K-ANN search QPS 687.526 843.368 (+22%) Second table size (data + index) 8.1M 2.16M (-73%) - New second table data and index sizes
t1#i#01.*:-rw-rw---- 1 wenhug amazon 1.9M Apr 12 20:59 t1#i#01.MYD -rw-rw---- 1 wenhug amazon 272K Apr 12 20:59 t1#i#01.MYI -rw-rw---- 1 wenhug amazon 1.5K Apr 12 20:58 t1.frm -rw-rw---- 1 wenhug amazon 809K Apr 12 20:59 t1.MYD -rw-rw---- 1 wenhug amazon 93K Apr 12 20:59 t1.MYI - Old second table data and index sizes
t1#i#01.*:-rw-rw---- 1 wenhug amazon 5.0M Apr 12 17:16 t1#i#01.MYD -rw-rw---- 1 wenhug amazon 3.1M Apr 12 17:17 t1#i#01.MYI -rw-rw---- 1 wenhug amazon 1.5K Apr 12 17:15 t1.frm -rw-rw---- 1 wenhug amazon 809K Apr 12 17:16 t1.MYD -rw-rw---- 1 wenhug amazon 93K Apr 12 17:17 t1.MYI
- New second table data and index sizes
-
With dataset of random-s-100-euclidean:
Metric old new change Insert time for 90000 records 1328.48 sec 259.15 sec (5 times faster) K-ANN search recall 0.221 0.335 (+50%) K-ANN search QPS 539.019 668.827 (+24%) Second table size (data + index) 79M 20.8M (-73%) -
New second table data and index sizes
t1#i#01.*:-rw-rw---- 1 wenhug amazon 18M Apr 12 17:30 t1#i#01.MYD -rw-rw---- 1 wenhug amazon 2.8M Apr 12 17:33 t1#i#01.MYI -rw-rw---- 1 wenhug amazon 1.5K Apr 12 17:26 t1.frm -rw-rw---- 1 wenhug amazon 36M Apr 12 17:30 t1.MYD -rw-rw---- 1 wenhug amazon 905K Apr 12 17:33 t1.MYI -
Old second table data and index sizes
t1#i#01.*:-rw-rw---- 1 wenhug amazon 46M Apr 12 17:56 t1#i#01.MYD -rw-rw---- 1 wenhug amazon 33M Apr 12 17:59 t1#i#01.MYI -rw-rw---- 1 wenhug amazon 1.5K Apr 12 17:34 t1.frm -rw-rw---- 1 wenhug amazon 36M Apr 12 17:56 t1.MYD -rw-rw---- 1 wenhug amazon 905K Apr 12 17:59 t1.MYI
-
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.
Assigned to @cvicentiu as it is against his branch.
Hi @HugoWenTD!
Thank you for the contribution.
I will address your work in order of appearance:
- The benchmark tool is imensely helpful. I will rebase it at the beggining of bb-11.4-vec-vicentiu for ease of use for all of us. We will have to clean it up a bit, and perhaps it should not reside in the MariaDB Server repo directly, but that's a problem for another time, not relevant during this development effort.
In any case, I have used it to double check your findings and I can reproduce everything (except the results for this particular PR, but more on this later)
- The "bugfix" you suggested on only deleting one-directional connections: (PR: #3156) I've re-read the paper a few times and I strongly believe that the graph is only meant to be considered an undirected graph, hence all connections are bi-directional. If you can point out a paragraph to me (or any other source) where the original HNSW allows one-directional links, I am more than happy to reconsider.
My understanding (at this point in time) is that by "removing only a subset of links", as you've done in #3156 causes a soft increase in MAX_NEIGHBORS_PER_LAYER. The same effect can be achieved by just bumping the MAX_NEIGHBORS_PER_LAYER variable.
So I am inclined to reject that bugfix and instead focus on tuning the graph construction parameters.
- Here is a set of results I got using the following constants, from the commit 2f6b2b3b7d582b9a483a0ab98448a9ad745d3e51:
const uint EF_CONSTRUCTION = 32; // max candidate list size to connect to.
const uint MAX_INSERT_NEIGHBOUR_CONNECTIONS = 100;
const uint MAX_NEIGHBORS_PER_LAYER = 100;
const double NORMALIZATION_FACTOR = 1.2;
without any of your changes in PR #3156
0: MariaDB(m=16, ef_construction=200, ef_search=10) 0.906 1035.211
So we can achieve 90% recall by having more neighbors per layer and bumping eF_construction. Of course, the QPS drops, as the graph is bigger, but as you've noticed, the code was initially written for correctness, not for speed. Once we've proven corectness, we can look into speed improvements.
- I tried your code from this PR (437e21444ceec6992cd35f4449632a5dec07a9bf). I get the following results, which I believe are due to a bug somewhere:
0: MariaDB(m=16, ef_construction=200, ef_search=10) 0.005 5095.530`
Can you double check your findings and see if you've submitted the correct code to this PR?
In general I agree with your approach of using an adjancency list (layer, src, neighbors[]) instead of tuples (layer, src, dst).
This approach has some benefits for sure:
-
The number of rows in the table is drastically reduced. (reducing index size)
-
There is less data duplication in the table. (again reducing index size)
-
We are more memory friendly and can thus run searches faster. (I believe we should optimize for SELECT, not for INSERT/UPDATE/DELETE, but of course we can not completely slow down INSERTS)
-
Because of less data to be written to disk, INSERTS should be faster (as your benchmark shows)
-
The adjacency list approach removes a lot of
ha_index_nextcalls because it loads all neighbours in memory on one lookup. That's great! The downsides: -
Updating a connection will cause a much larger write in the storage engine. We need to benchmark with longer neighbour lists (100 neighbors max for instance).
-
Delete may be trickier to implement. I had started on working on delete with my version of code (layer, src, dst). But I think that your solution has more promise. The way I'm implementing delete is by using a "deleted" bit for a node. If the node is deleted, we skip it during all iterations / searches. I can probably adapt the solution to work with your code. I'll try this right now and share the result as soon as it's working.
-
Your benchmark shows that we need to be able to parameterize the index construction at runtime so we can expriment without recompiling. Would you be interested in tackling that bit? You can create some system variables for now and we'll use those. A quick and dirty solution, just to map with your already written benchmark code. I noticed you have some commented out SET commands that you wanted to use:
ann_benchmarks/algorithms/mariadb/module.py
288: #self._cur.execute("SET hnsw.ef_search = %d" % ef_search)
That's all the thoughts I have for now, let me know when you've double checked the benchmark results for your code and in case I messed up when running it, I'll fix it.
@HugoWenTD I've done a little bit more experimenting with your bugfix from #3156. For certain combinations of parameters, it does seem to offer a marginal improvement in both QPS and recall, but I'd like to test this more systematically.
Hi @cvicentiu , Thank you for reviewing the pull requests and provide feedback on the benchmarking tool. I really appreciate your input.
I'm traveling for meetings this week but I will try to address on the items you mentioned later this week or next week. You made an important point that the mMax (max neighbours) affects the update speed when using node->neighbours[]. I believe I did run tests with some bigger mMax values. but for the comparison with previous implementation I only recorded the results of the default parameters. I will gather more data on the impact of mMax and get back to you.
Regarding your comment:
I've re-read the paper a few times and I strongly believe that the graph is only meant to be considered an undirected graph, hence all connections are bi-directional. If you can point out a paragraph to me (or any other source) where the original HNSW allows one-directional links, I am more than happy to reconsider.
Here are a few points supporting the use of one-directional links:
- The algorithm 1 in the paper describes the shrink steps:
At line 16, set neighbourhood(e) at layer lc to eNewConn is to shrink only one-direction connection, and it does not mention deleting the connection of the opposite direction.
-
In addition, PgVector's shrinking step is in this section.
It also shrinks only one direction. -
In the worst case, there could be a scenario where vectors gets isolated:
- Assume mMax-level0 is 4
- When (V5) vector 5 is inserted, V4 got isolated.
- If V4 happens to be entry point of higher level. That means none of V1/2/3/5 will be ever traversed in future insert or search.
- While increasing mMax could mitigate this, it still reduces the navigability of the graph, in the benchmark tests, I did see lower recall caused by fewer vectors were returned than the specified limit.
Your benchmark shows that we need to be able to parameterize the index construction at runtime so we can expriment without recompiling. Would you be interested in tackling that bit? You can create some system variables for now and we'll use those. A quick and dirty solution, just to map with your already written benchmark code. I noticed you have some commented out SET commands that you wanted to use:
Hi @cvicentiu, I just created this PR https://github.com/MariaDB/server/pull/3226 targeting your newly created branch bb-11.4-vec-vicentiu-hugo, for configuring the parameters through system variables. Please be aware that the modification in ann-benchmark is not backward compatible, so I have created a new branch to accommodate this change.
Updating a connection will cause a much larger write in the storage engine. We need to benchmark with longer neighbour lists (100 neighbors max for instance).
Sharing some benchmark results with different m value using DATASET=random-s-100-euclidean. The insert time and search time increased as expected, but still acceptable I think. Will test some more combinations with different ef_construction values.
0: MariaDB(m=100, ef_construction=10, ef_search=10)
recall: 0.846
QPS: 669.383
Insert time : 252.37262177467346
1: MariaDB(m=50, ef_construction=10, ef_search=10)
recall: 0.825
QPS: 678.479
Insert time : 258.6093146800995
2: MariaDB(m=10, ef_construction=10, ef_search=10)
recall: 0.335
QPS:1052.397
Insert time : 192.79994201660156
partially included info e317023, the rest is obsolete