sve optimization for HNSW::MinimaxHeap::pop_min()
- sve optimization for HNSW::MinimaxHeap::pop_min()
- Add prefetch for ids.data() and dis.data() to reduce memory latency
The unit test for pop_min():
$ ./faiss_test --gtest_filter=HNSW.Test_popmin* WARNING clustering 1000 points to 40 centroids: please provide at least 1560 training points Running main() from /home/scratch.lyou_gpu/arm/workspaces/faiss-main/build/_deps/googletest-src/googletest/src/gtest_main.cc Note: Google Test filter = HNSW.Test_popmin* [==========] Running 3 tests from 1 test suite. [----------] Global test environment set-up. [----------] 3 tests from HNSW [ RUN ] HNSW.Test_popmin [ OK ] HNSW.Test_popmin (0 ms) [ RUN ] HNSW.Test_popmin_identical_distances [ OK ] HNSW.Test_popmin_identical_distances (0 ms) [ RUN ] HNSW.Test_popmin_infinite_distances [ OK ] HNSW.Test_popmin_infinite_distances (0 ms) [----------] 3 tests from HNSW (0 ms total)
[----------] Global test environment tear-down [==========] 3 tests from 1 test suite ran. (0 ms total) [ PASSED ] 3 tests.
Performance Result:
Benchmark: cuvs bench https://github.com/rapidsai/cuvs/tree/main/cpp/bench/ann datasets: deep-96-image Threads No: 1 and 8 Test Machine: Nvidia Grace CPU
1 Thread
| Configuration | Baseline | Optimized | Speedup | Recall |
|---|---|---|---|---|
| M16.efConstruction128.efSearch16 | 0.1647ms | 0.1643ms | 1.002x | 0.717 |
| M16.efConstruction128.efSearch64 | 0.2858ms | 0.2829ms | 1.010x | 0.914 |
| M16.efConstruction128.efSearch256 | 0.7482ms | 0.7220ms | 1.036x | 0.982 |
| M16.efConstruction128.efSearch1024 | 2.9258ms | 2.6881ms | 1.088x | 0.996 |
| M32.efConstruction128.efSearch16 | 0.1812ms | 0.1802ms | 1.006x | 0.784 |
| M32.efConstruction128.efSearch64 | 0.3297ms | 0.3254ms | 1.013x | 0.940 |
| M32.efConstruction128.efSearch256 | 0.8822ms | 0.8530ms | 1.034x | 0.990 |
| M32.efConstruction128.efSearch1024 | 3.3204ms | 3.0752ms | 1.080x | 0.998 |
| M32.efConstruction256.efSearch64 | 0.3540ms | 0.3498ms | 1.012x | 0.954 |
| M32.efConstruction256.efSearch256 | 0.9627ms | 0.9392ms | 1.025x | 0.994 |
Summary (1 Thread)
- Best speedup: 1.088x (M16.efConstruction128.efSearch1024)
- Average speedup: ~1.020x
- Speedup range: 1.002x - 1.088x
- Larger efSearch values show better improvements (up to 8.8% faster)
8 Threads
| Configuration | Baseline | Optimized | Speedup | Recall |
|---|---|---|---|---|
| M16.efConstruction128.efSearch16 | 0.0856ms | 0.0855ms | 1.001x | 0.714 |
| M16.efConstruction128.efSearch64 | 0.2157ms | 0.2128ms | 1.014x | 0.911 |
| M16.efConstruction128.efSearch256 | 0.7099ms | 0.6857ms | 1.035x | 0.982 |
| M16.efConstruction128.efSearch1024 | 2.9916ms | 2.7619ms | 1.083x | 0.997 |
| M32.efConstruction128.efSearch16 | 0.1047ms | 0.1045ms | 1.002x | 0.773 |
| M32.efConstruction128.efSearch64 | 0.2664ms | 0.2633ms | 1.012x | 0.940 |
| M32.efConstruction128.efSearch256 | 0.8598ms | 0.8363ms | 1.028x | 0.990 |
| M32.efConstruction128.efSearch1024 | 3.4262ms | 3.1977ms | 1.071x | 0.998 |
| M32.efConstruction256.efSearch64 | 0.2936ms | 0.2911ms | 1.008x | 0.955 |
| M32.efConstruction256.efSearch256 | 0.9519ms | 0.9282ms | 1.026x | 0.994 |
Summary (8 Threads)
- Best speedup: 1.083x (M16.efConstruction128.efSearch1024)
- Average speedup: ~1.018x
- Speedup range: 1.001x - 1.083x
- Larger efSearch values show better improvements (up to 8.3% faster)