BP Reordering Codec
This is a stab at implementing the previously-published BpVectorReorderer as part of the codec. The version we have now can be used as a merge policy to reorder docids by doing BP over a vector field. This version allows for the ordering to be done at the vector ordinal level for each field independently while docids may be soprted by some other IndexSort.
The code is fully working as far as I can tell, and I think the design is reasonable, but some TODOs and nocommits remain:
- Test results have so far been underwhelming, and I plan to continue to work to see if it is producing results thatare on par with the merge policy version. There might be a bug somewhere, and if anyone is interested in looking, I'd appreciate feedback.
- There isn't a convenient way to enable/disable reordering in tests. Currently a few tests will fail (the TestBpVectorReorder in particular since it is reordering both at the codec level and separately in the merge policy.
OK after finally having ironed out the bugs, I have some results. The situation is a little complicated as the change here really doesn't help much with the typical "dense" index where every document has a vector. I think the reason is that any gains are masked by the additional cost of having a node->doc mapping that must be traversed. On the other hand, in the "sparse" case where some documents have no vectors, we already have such a mapping, so we can see the impact of this change more clearly. Net/net we see improvements in search latency, increasing with index size. On indexes of 1-2MM I see 5% improvement, on 10MM, a 10% improvement. As expected, vex files show a decrease in size (about 15%). There is also an increase in vem since that is where we store the new node->doc mapping, but this is pretty small. Merge times go up a lot - this metric varies quite a bit, but seems to be about 100% increase.
It may be possible to reduce the merge times by tweaking the parameters of the BP execution to make it recurse less? I'll see if I can do that while retaining the latency improvements. Then it might be best to enable this only for sparse indexes.
luceneutil results from one run with 10MM cohere 768-dim wiki documents, with a 50% selectivity filter (I made some changes to luceneutil to be able to create an index with 50% "empty" documents):
recall latency(ms) netCPU avgCpuCount visited index(s) index_docs/s force_merge(s) index_size(MB) bp-reorder
0.923 3.350 3.338 0.996 5990 2754.51 3630.41 2283.57 15169.98 false
0.921 3.052 3.042 0.997 5691 2858.65 3498.15 4312.79 15123.17 true
It's a little surprising we see recall changes, since the graphs should be the same, merely reordered? They seem to be sometimes worse, sometimes better, and never more than about 0.002 different
They seem to be sometimes worse, sometimes better, and never more than about 0.002 different
Could it be accumulated floating point errors?
It's a little surprising we see recall changes, since the graphs should be the same, merely reordered?
If a graph builds with seeing the vectors in a different order (which can happen in concurrent merging), I would expect slight recall deviations on lower ef_search parameters.
I've done many runs, and for the most part I'm seeing identical recall, so I think we can safely ignore these occasional variants. It probably is down to order differences from concurrent indexing, as @benwtrent says.
I can also report that the merge time difference is greatly reduced if I enable concurrent reordering by passing the mergestate intra-merge executor in. EG:
recall latency(ms) netCPU avgCpuCount visited index(s) index_docs/s force_merge(s) index_size(MB) bp-reorder
0.937 3.022 3.002 0.993 5139 155.35 6437.20 129.43 1510.60 false
0.938 2.679 2.669 0.996 5140 160.86 6216.51 150.13 1507.24 true
and then by reducing the reorder MAX_ITERS from 20 to 10 we can regain some more:
latency(ms) netCPU avgCpuCount visited index(s) index_docs/s force_merge(s) index_size(MB) bp-reorder
3.095 3.085 0.997 5145 156.18 6402.87 127.41 1510.59 false
2.705 2.695 0.996 5133 160.26 6239.82 144.53 1507.26 true
(note: recall is not being shown in this table because it's identical. I made a local change to the script to hide columns that don't vary)
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!