lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Feature/scalar quantized off heap scoring

Open benwtrent opened this issue 1 year ago • 9 comments

This adds off-heap scoring for our scalar quantization.

Opening as DRAFT as I still haven't fully tested out the performance characteristics. Opening early for discussion.

benwtrent avatar Jun 17 '24 22:06 benwtrent

Half-byte is showing up as measurably slower with this change.

Candidate:

0.909	 0.54
0.911	 0.58
0.919	 0.88

baseline:

0.909	 0.30
0.911	 0.33
0.919	 0.47

Full-byte is slightly faster

candidate:

0.962	 0.41
0.966	 0.43
0.978	 0.66

baseline:

0.962	 0.47
0.966	 0.48
0.978	 0.73

benwtrent avatar Jun 17 '24 23:06 benwtrent

are you reporting indexing times? query times?

msokolov avatar Jun 18 '24 15:06 msokolov

are you reporting indexing times? query times?

Query times, single segment, 10k docs of 1024 dims.

benwtrent avatar Jun 18 '24 16:06 benwtrent

Ok, I double checked, and indeed, half-byte is way slower when reading directly from memory segments instead of reading on heap. memsegment_vs_baseline.zip

The flamegraphs are wildly different. So much more time is being spent reading from memory segment and then comparing the vectors

candidate (this PR): image

baseline:

image

benwtrent avatar Jul 10 '24 18:07 benwtrent

@ChrisHegarty have you seen a significant performance regression on MemorySegments & JDK22?

Doing some testing, I updated my performance testing for this PR to use JDK22 and now it is WAY slower, more than 2x slower, even for full-byte.

For int7, this branch is marginally faster (20%) with JDK21, but basically 2x slower on JDK22.

I wonder if our off-heap scoring for byte vectors also suffers on JDK22. The quantized scorer for int7 is just using those same methods.

benwtrent avatar Jul 10 '24 19:07 benwtrent

To verify it wasn't some weird artifact in my code, I slightly changed it to where my execution path always reads the vectors on-heap and then wraps them in a memorysegment. Now JDK22 performs the same as JDK21 & the current baseline.

Its weird to me that reading from a memory segment onto ByteVector objects would be 2x slower on JDK22 than 21.

Regardless that its already much slower for the int4 case on both jdk 21 & 22.

benwtrent avatar Jul 10 '24 19:07 benwtrent

Regardless that its already much slower for the int4 case on both jdk 21 & 22.

@benwtrent I was not aware, lemme take a look.

ChrisHegarty avatar Jul 11 '24 13:07 ChrisHegarty

+1 to this feature

I work on Amazon product search, and in one of our searchers we see a high proportion of CPU cycles within HNSW search being spent in copying quantized vectors to heap:

Screenshot 2025-06-25 at 2 16 43 PM

Perhaps off-heap scoring could help us!

kaivalnp avatar Jun 25 '25 18:06 kaivalnp

@kaivalnp feel free to take my initial work here and dig in deeper.

I haven't benchmarked it recently on later JVMs to figure out why I was experiencing such a weird slowdown when going off heap :/

benwtrent avatar Jun 25 '25 20:06 benwtrent

Thanks @benwtrent! I opened #14863

kaivalnp avatar Jun 29 '25 10:06 kaivalnp

I am gonna close this as work is progressing elsewhere, also, we should just move to off-heap bulk scoring ;)

benwtrent avatar Aug 14 '25 15:08 benwtrent