Ensure that no node duplicates exist in the adjacency list of any node.
Most nodes in a graph have MANY duplicated entries in their adjacency list. This is undesirable as it reduces the effective degree of the graph.
This was occurring because scores where the main vehicle to check whether a node was inserted twice in an adjacency list. However, when we are building a graph with PQ we use a mix of sim(x1, quant(x2)) and sim(quant(x1), quant(x2)) similarities. Because quantization is lossy, sim(x1, quant(x2)) != sim(quant(x1), quant(x2)). Thus, we can have two different scores associated with a given node, depending on which quantized similarity we use.
This PR changes the way we are computing duplicates in NodeArray, to prevent the emergence of these duplicates. Now, we only use the node ordinals for these checks.
One potential future improvement is to order every adjacency by the node ordinals, so that duplicate checks are faster. There is a fine tradeoff between accelerating these checks and decelerating the diversity computation that needs to be carefully analyzed.
Initial benchmarking shows that index construction got a bit slower (~10%), which is not surprising. Even if this PR moves things in the right direction conceptually, in practice it does not offer a net benefit. I think that we need a better solution and not just a patch. Will transform the PR into a draft until that superior solution is in place.
The most recent commits overhaul the strategy used to ensure that edges in the graph are unique. Now, each adjacency list is sorted by node ID in ascending order.
NodeArray has a method public NodesIterator getIteratorSortedByScores() that is used when pruning each adjacency list so that the nodes are explored in descending order of the the scores. This adds an additional sorting operation in the pruning step. However, pruning is much less frequent than insertions in the adjacency lists (because of backlinks), so we should not see a detrimental effect in performance.