Bring back the fused graph index
This PR does extensive work to bring back the Fused Graph Index (FGI). In a non-fused graph, the PQ codebook of each vector in the index is stored in memory. The memory complexity is the linear in the number of vectors. FGI reduces significantly the amount of heap memory used during search by offloading the PQ codebooks to storage. These PQ codebooks are packed and stored in-line with the graph, to avoid runtime overheads resulting from this offload.
The memory complexity has two cases now:
- When using a non-hierarchical graph, the fused graph reduces the linear memory complexity to a small constant (the number of vectors in the graph does not change this constant).
- When using a hierarchical graph, the upper layers of the hierarchy are kept in memory an the bottom layer is in storage. The PQ codebooks of those vectors in the upper layers are kept in memory. The bottom layer behaves exactly like a non-hierarchical graph. Since the upper graph layers are sampled using a logarithmic distribution, we end up with a logarithmic memory complexity.
These savings come with a very moderate slowdown (reduction in throughput and increase in latency) of about 15%. See the results below for an example.
In this version (and in past versions), FGI only works with PQ through the FUSED_PQ feature. This feature used to be called FUSED_ADC, but to highlight the link with PQ, it has been renamed.
The routine for expanding a node (gathering its out-neighbors and computing their similarities to the query), has been pushed down to the GraphIndex views. This enables having slightly different algorithms depending on the graph layout that may be a little bit more efficient than if abstracted away in the GraphSearcher.
This PR refactors the use of SIMD instructions by FUSED PQ:
- The old algorithm used a transposed layout similar to Quick(er)-ADC. However, that design performed the SIMD parallelization not for each codebook, but across different codebooks (thus the transpose). This parallelization was virtually impossible to combine with the skipping of the computation for previously computed similarities and implied additional computational overhead.
- An analysis of the number of skipped similarity computations yielded about 50% skips. thus, the new algorithm simplifies the layout by storing the vectors in a non-transposed fashion, with no across-codebook parallelization. This makes it compatible with the visited checks. Additionally, it is more efficient than the non-fused approach because we have improved the locality of the PQ codebooks to gather when expanding a given graph node (in the non-fused graph this required random accesses to multiple rows of a very tall and skinny matrix that exhibit poor locality).
These SIMD changes have opened the possibility of deprecating the native vector util backend. Not effecting this deprecation in this PR because there might be another considerations to keep it around.
Edits:
- To enable the FUSED_PQ feature, we introduced the new version 6 file format for our graph indices.
Experimental results:
Dataset: ada002-100k Configuration: M : 32 usePruning : true neighborOverflow : 1.2 addHierarchy : true efConstruction : 100
Results with topK=10
With a non-fused graph:
Overquery Avg QPS (of 3) ± Std Dev CV % Mean Latency (ms) STD Latency (ms) p999 Latency (ms) Avg Visited Avg Expanded Base Layer Recall@10
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1.00 16118.8 285.2 1.8 0.456 0.049 0.770 307.4 12.7 0.67
2.00 14100.7 19.5 0.1 0.507 0.064 0.802 421.6 22.9 0.85
5.00 11307.2 260.8 2.3 0.640 0.111 1.093 702.5 52.9 0.94
10.00 8335.0 88.8 1.1 0.849 0.184 1.587 1108.2 102.1 0.97
With a fused graph:
Overquery Avg QPS (of 3) ± Std Dev CV % Mean Latency (ms) STD Latency (ms) p999 Latency (ms) Avg Visited Avg Expanded Base Layer Recall@10
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1.00 13768.9 972.6 7.1 0.533 0.045 0.909 307.4 12.7 0.67
2.00 12384.5 582.9 4.7 0.586 0.057 0.966 421.6 22.9 0.85
5.00 9609.4 151.8 1.6 0.735 0.098 1.254 702.5 52.9 0.94
10.00 7135.4 162.5 2.3 0.955 0.163 1.603 1108.2 102.1 0.97
With the fused graph, the number of queries per second (QPS) is slowed down by less than 15% with an average of 14% and the latency by less than 17% with an average of 15%.
Results with topK=100
With a non-fused graph:
Overquery Avg QPS (of 3) ± Std Dev CV % Mean Latency (ms) STD Latency (ms) p999 Latency (ms) Avg Visited Avg Expanded Base Layer Recall@100
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1.00 8314.4 307.9 3.7 0.862 0.187 1.570 1108.2 102.1 0.78
2.00 5415.1 17.0 0.3 1.294 0.302 2.293 1871.1 198.0 0.93
With a fused graph:
Overquery Avg QPS (of 3) ± Std Dev CV % Mean Latency (ms) STD Latency (ms) p999 Latency (ms) Avg Visited Avg Expanded Base Layer Recall@100
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1.00 6744.6 303.1 4.5 0.975 0.172 1.800 1108.2 102.1 0.78
2.00 4550.6 113.0 2.5 1.401 0.263 2.445 1871.1 198.0 0.93
With the fused graph, the number of queries per second (QPS) is slowed down by 19% and 16% (overquery=1 and 2, respectively) with an average and the latency by 13% and 8% (overquery=1 and 2, respectively).
Experimental results on larger datasets
In the plots below, QPS, latency, and recall are stable (there's run-to-run variability that is intrinsic to the benchmark). Index construction time increased a bit by the process of fusing the graph on disk, which involves multiple random memory accesses for each node, and writing more to disk.
Before you submit for review:
- [x] Does your PR follow guidelines from CONTRIBUTIONS.md?
- [x] Did you summarize what this PR does clearly and concisely?
- [x] Did you include performance data for changes which may be performance impacting?
- [x] Did you include useful docs for any user-facing changes or features?
- [x] Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
- [x] Did you trigger and review regression testing results against the base branch via Run Bench Main?
- [x] Did you adhere to the code formatting guidelines (TBD)
- [x] Did you group your changes for easy review, providing meaningful descriptions for each commit?
- [x] Did you ensure that all files contain the correct copyright header?
If you did not complete any of these, then please explain below.
@tlwillke I did some measurements for ada002-100k, using 192 PQ segments. The dataset contains 99562 vectors.
According the ramBytesUsed estimate in PQVectors, they should take 19.74 MB. Roughly speaking, this matches the count 192 * 99562 / (1024 * 1024) = 18.23 MB.
I used:
System.gc();
double usedMemory = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
System.out.format("Used Memory: [%.2f MB]%n", usedMemory / (1024 * 1024));
to measure the memory used before and after loading the PQVectors in memory, which the fused graph avoids.
According to this method, the memory used before loading the PQvectors is 672.02 MB and 697.73 MB after. Thus, the PQVectors occupy 25.71MB. This is larger than 19.74, so either the estimate is a bit optimistic or the garbage collection did not actually collect everything.
When running the fused graph index, the memory consumption pre and post loading is flat, as there is no loading.
Happy to incorporate these these changes in Grid. They have performance upsides and downsides in the specific setting of grid, where we are running a matrix of configurations, so that efficiency may be different than when running a single configuration.
Slight refactor in Grid.runOneGraph so that the compressed vectors are loaded only when needed.