lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces

Open msokolov opened this issue 1 year ago • 5 comments

addresses #13831

The basic idea is to move the scratch arrays and cloned IndexInputs (generally, any stateful data) into things returned by KnnVectorValues, so that class itself no longer needs to be cloned in order to get unique sources of vectors (or scorers). ByteVectorValues and FloatVectorValues got a new vectors() method (this returns the "dictionary") that supports the random access. Also, RandomVectorScorer gets cloned inputs and scratch data when created rather than relying on getting these from its enclosing values instance.

Naming notes:

The issue calls for a "dictionary" interface, but I found the name a bit confusing, so I undertook the following renaming: The "dictionary" interface is represented by FloatVectorValues.Floats and ByteVectorValues.Bytes (hearkening back to the RandomAccessVectorValues classes) and these new objects are returned by the new *VectorValues.vectors() methods. Where these methods are called, I've changed the names of the variables storing these things to vectors. Instance of KnnVectorValues are now mostly stored in variables called vectorValues; these were called various things before including vectors, values, or vectorValues. I left some called values since I didn't want to touch any more more files.

I've renamed the method vectorValue(int ord) to get(int ord) since there were entirely too many vectors, values, and vectorValues running around.

I also ensured that KnnVectorValues.iterator() always returns a unique instance. Previously we had been caching in a few places and returning a shared instance, which seems like a bug, although I don't think it caused any problems given our usage.

All in all it's a lot of fussy non-functional changes but I do think the clarity makes it worth doing now after ~5 years of evolution of these APIs

msokolov avatar Oct 07 '24 21:10 msokolov

I'll merge to main soon and let tests noodle on this for a few days before backporting to 11.x. It seems benign, but it's easy to make an accidental slip in the code hurricane

msokolov avatar Oct 08 '24 14:10 msokolov

Thanks for the insightful feedback - yeah I had been intending to do perf testing, and then got distracted by fascinating talks and kind of forgot about these concerns! Going through the code adding all these allocations I was kind of thinking most of them would be infrequent, but I agree if we are creating scorers per node that isn't going to be acceptable, so we need to find a way of sharing just enough but not too much. Anyway there's no rush to get this in, I'll take some time to dig in.

msokolov avatar Oct 08 '24 21:10 msokolov

hm there is some functional problem with the change that yields terrible recall for quantized vectors. I'll dig and fix and see if I can beef up the unit test coverage as well.

msokolov avatar Oct 09 '24 03:10 msokolov

hm there is some functional problem with the change that yields terrible recall for quantized vectors. I'll dig and fix and see if I can beef up the unit test coverage as well.

This likely means somewhere the scratch space isn't being appropriately handled :/

benwtrent avatar Oct 09 '24 11:10 benwtrent

With the most recent commit I saw these luceneutil/knnPerfTest.py results:

1. baseline

recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.816         0.294  1500000    10       6       32         50         no   341.37         110.92             1          1534.03
 0.811         0.308  1500000    10       6       32         50     7 bits   346.68          93.22             1          1906.16
 0.786         0.288  1500000    10       6       32         50     4 bits   346.28          89.15             1          1906.10

this change with defaults (no command line flags)

recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.817         0.304  1500000    10       6       32         50         no     344.11      111.70             1          1533.94
 0.812         0.231  1500000    10       6       32         50     7 bits     354.29       89.76             1          1906.16
 0.785         0.239  1500000    10       6       32         50     4 bits     352.37        89.01             1          1906.12

This change with vector api enabled:

recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.817         0.247  1500000    10       6       32         50         no     0.00           0.17             1          1533.94
 0.812         0.282  1500000    10       6       32         50     7 bits     0.00           0.17             1          1906.16
 0.785         0.207  1500000    10       6       32         50     4 bits     0.00           0.17             1          1906.12

This change with vector api and enable-native-access

recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.817         0.246  1500000    10       6       32         50         no     0.00           0.17             1          1533.94
 0.812         0.290  1500000    10       6       32         50     7 bits     0.00           0.17             1          1906.16
 0.785         0.206  1500000    10       6       32         50     4 bits     0.00           0.18             1          1906.12

So I think there is some slowdown in the quantized indexing. I think we need to find a solution for the over-allocations due to having moved this logic from ScorerSupplier to Scorer. The best idea I have is to make Scorers mutable and supply them with new target vectors as needed. WDYT?

msokolov avatar Oct 22 '24 19:10 msokolov

Can you clarify which allocation is the problematic one, and where it's done on the indexing path?

jpountz avatar Oct 25 '24 13:10 jpountz

Can you clarify which allocation is the problematic one, and where it's done on the indexing path?

See Ben's comments from ~2 weeks ago where he calls out the problem of overallocation. During indexing we call HnswGraphBuilder.diversityCheck() multiple times for each document (graph node) we insert, and in each of those calls we create scorers multiple times -- this is an n^2 algorithm (with n ~ number of neighbors). I'm proposing that instead of calling scorer() and creating a new scorer each time (which may in turn create a MemorySegment or a scratch array of some sort), that we instead have a mutable Scorer that can accept a new target vector.

msokolov avatar Oct 25 '24 13:10 msokolov

that we instead have a mutable Scorer that can accept a new target vector.

Yes, that is something that I've noodled on for a while now too - a scorer that accepts two ords, and returns the score. This will save gigabytes garbage, which can be seen in the blunder output of the nightly luceneutil runs. Tho, you do no have to do it all in this PR. E.g. https://blunders.io/jfr-demo/indexing-1kb-vectors-2024.10.24.18.04.28/top_allocators

Screenshot 2024-10-25 at 14 20 51

ChrisHegarty avatar Oct 25 '24 13:10 ChrisHegarty

I think a "merging scorer" would be good. The only place the "scorer supplier" is used is during graph building.

My initial concern with a "mutable scorer" is that it would also make the single scorer mutable, which seems weird to me. But I am happily to revisit this, especially since its blocking a nice refactor.

Given that all these random scorer stuff is internal APIs, we can do whatever is best with what we have.

benwtrent avatar Oct 25 '24 13:10 benwtrent

Yes, OK I now see quite a bit of this is a "preexisting condition" and maybe not exacerbated by this change. We are still creating more scratch arrays than we did before though, I think, because previously we would copy() the VectorValues in a caller, and allocate a new scratch array there, whereas now since we have pushed down the "create new scratch array" into the Scorer creation, and this happens many more times than we would previously have copied the VectorValues, we are creating and destroying many more of these scratch arrays. Maybe this is acceptable and we can iterate in a futher cleanup? Let me try a few more benchmarking runs and be a little clearer about the impact on query and indexing times. I'd like to also report allocations, but not sure how to do that w/luceneutil

msokolov avatar Oct 25 '24 13:10 msokolov

Maybe we could add a RandomVectorScorer.setTarget(int node) method that would only be implemented by the Scorers returned from ScorerSuppliers?

msokolov avatar Oct 25 '24 13:10 msokolov

Maybe we could add a RandomVectorScorer.setTarget(int node) method that would only be implemented by the Scorers returned from ScorerSuppliers?

Let's defer a double addressing scorer to follow on change, since this already getting big!

I pushed a change to fix the off-heap scorer. The scratch buffers have been deferred to only when needed, and only per null on the slice, which is rare.

ChrisHegarty avatar Oct 29 '24 17:10 ChrisHegarty

FYI - I created the following issue to track the possibility of adding a scorer interface that scores one ordinal against another, without the need to create an instance of a scorer. #13966

ChrisHegarty avatar Oct 30 '24 12:10 ChrisHegarty

I re-tested this change, including @ChrisHegarty 's latest, and it still exhibits some bug that causes terrible recall for quantized vectors. I think it is only occurring when concurrent merging is enabled, so I'll dig around in there. Separately this seems to point to more gaps in our unit test coverage. I'll open a separate issue to enhance RandomCodec so it can explore different KnnVectorsFormat settings (like concurrent merges).

msokolov avatar Nov 06 '24 15:11 msokolov

In the process of adding a randomized KnnVectorsFormat to RandomCodec I learned that 4-bit quantization is not supported for odd-length vectors !? This makes the randomized testing quite difficult since we don't have a good way of ensuring that the 4-bit codec generated randomly by trhe test framework never comes in contact with the odd-dimensional vectors randomly generated by the test case. I wonder how hard it would be to actually support this?

msokolov avatar Nov 06 '24 19:11 msokolov

I wonder how hard it would be to actually support this?

It requires some logic and changes to optimized scoring. The purpose for preventing it was:

  • It makes the logic for int4 compression & scoring way simpler
  • Odd dimensioned vectors are rare or not used at all in practice

benwtrent avatar Nov 06 '24 19:11 benwtrent

Odd dimensioned vectors are rare or not used at all in practice

true - maybe we ought to ban outright? Otherwise this is kind of weird. One thing we could do is pad on the input side. I believe all our similarities would be invariant to adding an extra zero?

msokolov avatar Nov 06 '24 20:11 msokolov

Finally got back to this and fixed the aliasing that was happening. I ran some perf tests and don't see significant variance. Still, it's clear we must be doing a lot more allocations here during indexing; I plan to look at profiler output to see that. But maybe it's captured by escape analysis and we're able to re-use? I was thinking a possible solution to that could be to maintain a pool of Scorers in each ScorerSupplier and reclaim them with a close()? But maybe that can come as a follow up.

this patch

recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.965         2.995  1000000    10      64       64        250         no  1795.97        1329.54             1          3018.62
 0.878         2.844  1000000    10      64       64        250     7 bits  1715.70        1202.92             1          3749.72
 0.541         2.084  1000000    10      64       64        250     4 bits  1650.77         842.20             1          3764.59

mainline, Cohere, mip, 1M docs

recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.965         3.161  1000000    10      64       64        250         no  1813.15        1277.61             1          3018.62
 0.878         2.775  1000000    10      64       64        250     7 bits  1681.21        1168.95             1          3749.72
 0.541         2.092  1000000    10      64       64        250     4 bits  1646.78         802.31             1          3764.59

msokolov avatar Nov 07 '24 12:11 msokolov

hmm that commit is kind of messed up. Maybe I missed a rebase on main somewhere? I will try to clean up here, but it might require a force-push or a new PR, egad

msokolov avatar Nov 07 '24 12:11 msokolov

heap comparison

Here's the output from luceneutil's JFR heap usage summarizer. Clearly a huge amount more allocations for this change.

float32 mainline

PERCENT       HEAP SAMPLES  STACK
31.03%        262M          org.apache.lucene.util.hnsw.NeighborArray#<init>() [Inlined code]
18.07%        152M          org.apache.lucene.util.ArrayUtil#copyOfSubArray() [Inlined code]
9.11%         77M           java.lang.Integer#valueOf() [Inlined code]
8.61%         72M           org.apache.lucene.util.hnsw.NeighborQueue#nodes() [Inlined code]

float32 knn-dictionary

PERCENT       HEAP SAMPLES  STACK
94.25%        54257M        org.apache.lucene.codecs.lucene95.OffHeapFloatVectorValues#vectors() [Inlined code]
2.27%         1306M         org.apache.lucene.store.MemorySegmentIndexInput#clone() [Inlined code]
1.14%         656M          org.apache.lucene.util.hnsw.NeighborQueue#nodes() [Inlined code]
0.46%         266M          org.apache.lucene.store.MemorySegmentIndexInput$SingleSegmentImpl#<init>() [Inlined code]
0.43%         245M          org.apache.lucene.util.hnsw.NeighborArray#<init>() [Inlined code]
0.37%         214M          org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatScoringSupplier#scorer() [JIT compiled code]
0.24%         138M          org.apache.lucene.util.ArrayUtil#copyOfSubArray() [Inlined code]
0.14%         83M           java.lang.Integer#valueOf() [Inlined code]
0.11%         62M           org.apache.lucene.codecs.lucene95.OffHeapFloatVectorValues#vectors() [JIT compiled code]
0.07%         42M           org.apache.lucene.util.hnsw.NeighborQueue#nodes() [JIT compiled code]

int7 mainline

PERCENT       HEAP SAMPLES  STACK
53.56%        14617M        jdk.internal.foreign.SegmentFactories#fromArray() [Inlined code]
41.52%        11333M        jdk.internal.foreign.MemorySessionImpl#createHeap() [Inlined code]
1.16%         316M          org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorScorer#dotProductFactory() [Inlined code]
1.03%         280M          org.apache.lucene.util.hnsw.NeighborArray#<init>() [Inlined code]
0.56%         152M          org.apache.lucene.util.ArrayUtil#copyOfSubArray() [Inlined code]
0.29%         79M           java.lang.Integer#valueOf() [Inlined code]

int7 knn-dictionary

PERCENT       HEAP SAMPLES  STACK
40.22%        20231M        java.nio.HeapByteBuffer#<init>() [Inlined code]
30.42%        15298M        jdk.internal.foreign.SegmentFactories#fromArray() [Inlined code]
22.30%        11214M        jdk.internal.foreign.MemorySessionImpl#createHeap() [Inlined code]
1.35%         677M          org.apache.lucene.store.MemorySegmentIndexInput#clone() [Inlined code]
1.01%         506M          java.nio.ByteBuffer#allocate() [Inlined code]
0.82%         411M          org.apache.lucene.codecs.lucene99.OffHeapQuantizedByteVectorValues$1#<init>() [Inlined code]
0.77%         389M          org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorScorer#dotProductFactory() [Inlined code]
0.63%         315M          org.apache.lucene.codecs.lucene99.OffHeapQuantizedByteVectorValues#vectors() [Inlined code]
0.54%         270M          org.apache.lucene.util.hnsw.NeighborArray#<init>() [Inlined code]
0.31%         155M          org.apache.lucene.store.MemorySegmentIndexInput$SingleSegmentImpl#<init>() [Inlined code]
0.31%         154M          org.apache.lucene.util.ArrayUtil#copyOfSubArray() [Inlined code]
0.15%         75M           java.lang.Integer#valueOf() [Inlined code]

int4 mainline

PERCENT       HEAP SAMPLES  STACK
26.75%        348M          org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorScorer#dotProductFactory() [Inlined code]
21.15%        275M          org.apache.lucene.util.hnsw.NeighborArray#<init>() [Inlined code]
11.16%        145M          org.apache.lucene.util.ArrayUtil#copyOfSubArray() [Inlined code]
6.36%         82M           java.lang.Integer#valueOf() [Inlined code]
5.25%         68M           org.apache.lucene.util.hnsw.NeighborQueue#nodes() [Inlined code]
2.79%         36M           org.apache.lucene.util.SparseFixedBitSet#insertLong() [JIT compiled code]
2.54%         33M           java.util.Arrays#copyOf() [Inlined code]

int4 knn-dictionary

85.00%        20633M        java.nio.HeapByteBuffer#<init>() [Inlined code]
2.53%         614M          org.apache.lucene.store.MemorySegmentIndexInput#clone() [Inlined code]
2.05%         498M          java.nio.ByteBuffer#allocate() [Inlined code]
1.66%         403M          org.apache.lucene.codecs.lucene99.OffHeapQuantizedByteVectorValues$1#<init>() [Inlined code]
1.60%         388M          org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorScorer#dotProductFactory() [Inlined code]
1.16%         282M          org.apache.lucene.codecs.lucene99.OffHeapQuantizedByteVectorValues#vectors() [Inlined code]
1.14%         277M          org.apache.lucene.util.hnsw.NeighborQueue#nodes() [Inlined code]
1.05%         255M          org.apache.lucene.util.hnsw.NeighborArray#<init>() [Inlined code]
0.81%         196M          org.apache.lucene.store.MemorySegmentIndexInput$SingleSegmentImpl#<init>() [Inlined code]
0.60%         144M          org.apache.lucene.util.ArrayUtil#copyOfSubArray() [Inlined code]
0.34%         82M           java.lang.Integer#valueOf() [Inlined code]
0.18%         44M           org.apache.lucene.util.hnsw.NeighborQueue#nodes() [JIT compiled code]
0.16%         38M           org.apache.lucene.util.SparseFixedBitSet#insertLong() [JIT compiled code]
0.15%         36M           java.util.Arrays#copyOf() [Inlined code]

msokolov avatar Nov 07 '24 15:11 msokolov

thanks @msokolov I'll take another look at why the off-heap scorer is allocating so much.

ChrisHegarty avatar Nov 07 '24 16:11 ChrisHegarty

@ChrisHegarty I think it's expected since we would previously do the allocation once per KnnVectorValues, but now we are doing it once per RandomVectorScorer. I'm working on adding Closeable to some of these interfaces so that we can pool and reuse these objects. Maybe there's another way?

msokolov avatar Nov 07 '24 19:11 msokolov

The threading and reuse model in the current API is subtle, but works well, as it mostly minimises the potential for garbage.

Looking at this again, it's almost like anywhere there was a copy in scorers, we need to replace it with vectors in the constructor. Rather than creating a new instance each and every time on usage. Lemme see how this looks.

ChrisHegarty avatar Nov 08 '24 10:11 ChrisHegarty

@msokolov ok, this is that I was thinking - f1e0007. There might be other places too, this was just a start to see how it fairs in benchmarks.

ChrisHegarty avatar Nov 08 '24 11:11 ChrisHegarty

Looking again in a bit more detail, I dislike that it is not possible to get access to the vector values without always creating at least one slice. Previously this was not necessary. The separation of concerns that copy offered from access was subtle but worked. I'd like to better understand how much this will affect things. Maybe my last change was sufficient, and the extra slice is washed away in the noise!? I'll try to do some benchmarking on this too.

ChrisHegarty avatar Nov 08 '24 12:11 ChrisHegarty

I am now completely confused by this change. And I think that my last commit broke the threading model - even though all tests pass!. I'm not saying that we should not do some refactoring here, but I think that the original premise that this change is based upon is not quite right.

We have the following use cases:

  1. single-threaded: where the scoring can be performed by many scorer instances backed by two slices. Since the scorers can only possibly be scoring sequentially, and we can benefit from sharing the effective thread-local buffer.

  2. multi-threaded: we want to fork a new score supplier so that it can be used in a separate thread. Here, obviously, we want a new slice that is unaffected by the creating thread, and which will have it's own effective thread-local buffer.

My understanding is that conflating access and cloning/copying into a single method means that we cannot achieve both the above use cases in an optimal way.

ChrisHegarty avatar Nov 08 '24 12:11 ChrisHegarty

OK; re allocating vectors in supplier, yes that can't work because of aliasing across threads. I also found we do not adequately test this case. There are many things in here that are not tested; I think we need to add randomization in RandomCodec as we have for other codecs (postings, docvalues) to get better coverage, and we should also explicitly be testing many of the codec options we have added (multithreaded merges, different forms of quantization). Maybe we have coverage for some of these things, but it looks to me as if our format-specific tests just choose default options.

My understanding is that conflating access and cloning/copying into a single method means that we cannot achieve both the above use cases in an optimal way.

I am coming to a similar conclusion. Although I do think this API change can be made to work if we cache the scorers in the suppliers and return them when done (I was thinking of doing this by making them closeable), this does add a lot of extra bookkeeping and I'm not sure it's going to end up being better in the end.

As a side note, I see that IndexInput is Closeable -- but I don't think we ever close() the ones we allocate? Is this a problem?

msokolov avatar Nov 08 '24 13:11 msokolov

As a side note, I see that IndexInput is Closeable -- but I don't think we ever close() the ones we allocate? Is this a problem?

Not a problem per se. The creator of the scorers need to ensure that the primary IndexInput is closed. Clones do not need to be closed. My understanding is that all these primary index inputs are closed.

(I was thinking of doing this by making them closeable), this does add a lot of extra bookkeeping and I'm not sure it's going to end up being better in the end.

I'm not against this, but I remain to be convinced that it is worth it.

There are many things in here that are not tested;

++ its separate, but I completely agree that we need better tests.

ChrisHegarty avatar Nov 08 '24 15:11 ChrisHegarty

the last commit demonstrates re-using the resource-intensive bits (RandomVectorScorer and Floats) for the floarting point case. It brings back heap allocations to be similar to what we see on main.

msokolov avatar Nov 08 '24 17:11 msokolov

the last commit demonstrates re-using the resource-intensive bits (RandomVectorScorer and Floats) for the floarting point case. It brings back heap allocations to be similar to what we see on main.

Ok, this is great - we correctly understand where the allocation was coming from and have a way to avoid it.

I'm still not sure that this refactor is better than the model that we have today. And the need to make the scorer closable kinda hints towards that. ( I'm not against the fact that the scorer is closable, just that it adds little value to the API and is not enforced - we don't want to enforce it's closed status )

ChrisHegarty avatar Nov 08 '24 17:11 ChrisHegarty