luceneutil
luceneutil copied to clipboard
[Draft] Code to reproduce the disconnectedness in HNSW using open source datasets
Adding this code for someone to checkout and comment but not merge yet. This change picks up changes from earlier pending PR #234.
Added CheckHNSWConnectedness
for finding the connectedness of HNSW graph at each level of the graph and also overall disconnected nodes of the graph.
Refactored KnnIndexer out of KnnGraphTester to be able to use standalone and created KnnIndexerMain for it.
##Tests
- Use existing ant script or new Gradle task to create the vector files for wiki data using the gloVe data.
ant vectors300-docs
ant vectors100-docs
- Create the index for 100 dimension and 300 dimension vectors using the KnnIndexerMain
./gradlew :src:main:run -PmainClass=knn.KnnIndexerMain --args=" -docvectorspath lucene/benchmarks/data/enwiki-20120502-lines-1k-100d.vec -indexpath lucene/benchmarks/indices/vector_index -maxconn 16 -beamwidth 100 -vectorencoding FLOAT32 -similarityfunction DOT_PRODUCT -numdocs 1000000 -dimension 100"
- Check for HNSW connectedness
./gradlew :src:main:checkHnswConnected -Pindex="lucene/benchmarks/indices/vector300_index" -Pknn-field="knn"
The script currently only generated single segment indexes. I will try to create multiple segments and check again. But this proves that the there is disconnectedness in graph even when there is less dynamism in creating and updating documents.
Results
100-Dim 1M-vectors
~ 0.4 % disconnectedness
For index /home/nitirar/mywork/lucene/benchmarks/indices/vector100_index field : knn
Level = 0 Total Nodes = 1000000 Reachable Nodes = 996035 Unreachable Nodes = 3965 %Disconnectedness = 0.3965
Level = 1 Total Nodes = 62820 Reachable Nodes = 61442 Unreachable Nodes = 1378 %Disconnectedness = 2.193569
Level = 2 Total Nodes = 3879 Reachable Nodes = 3820 Unreachable Nodes = 59 %Disconnectedness = 1.5210105
Level = 3 Total Nodes = 233 Reachable Nodes = 232 Unreachable Nodes = 1 %Disconnectedness = 0.42918456
Level = 4 Total Nodes = 12 Reachable Nodes = 12 Unreachable Nodes = 0 %Disconnectedness = 0.0
Overall graph : Total Nodes = 1000000 Reachable Nodes = 996183 Unreachable Nodes = 3817 %Disconnectedness = 0.3817
300-Dim 1M-vectors
~ 1 % disconnectedness
For index /home/nitirar/mywork/lucene/benchmarks/indices/vector300_index field : knn
Level = 0 Total Nodes = 1000000 Reachable Nodes = 990377 Unreachable Nodes = 9623 %Disconnectedness = 0.9623
Level = 1 Total Nodes = 62820 Reachable Nodes = 60100 Unreachable Nodes = 2720 %Disconnectedness = 4.329831
Level = 2 Total Nodes = 3879 Reachable Nodes = 3756 Unreachable Nodes = 123 %Disconnectedness = 3.1709204
Level = 3 Total Nodes = 233 Reachable Nodes = 233 Unreachable Nodes = 0 %Disconnectedness = 0.0
Level = 4 Total Nodes = 12 Reachable Nodes = 12 Unreachable Nodes = 0 %Disconnectedness = 0.0
Overall graph : Total Nodes = 1000000 Reachable Nodes = 990729 Unreachable Nodes = 9271 %Disconnectedness = 0.9271