Feature/scalar quantized off heap scoring
This adds off-heap scoring for our scalar quantization.
Opening as DRAFT as I still haven't fully tested out the performance characteristics. Opening early for discussion.
Half-byte is showing up as measurably slower with this change.
Candidate:
0.909 0.54
0.911 0.58
0.919 0.88
baseline:
0.909 0.30
0.911 0.33
0.919 0.47
Full-byte is slightly faster
candidate:
0.962 0.41
0.966 0.43
0.978 0.66
baseline:
0.962 0.47
0.966 0.48
0.978 0.73
are you reporting indexing times? query times?
are you reporting indexing times? query times?
Query times, single segment, 10k docs of 1024 dims.
Ok, I double checked, and indeed, half-byte is way slower when reading directly from memory segments instead of reading on heap. memsegment_vs_baseline.zip
The flamegraphs are wildly different. So much more time is being spent reading from memory segment and then comparing the vectors
candidate (this PR):
baseline:
@ChrisHegarty have you seen a significant performance regression on MemorySegments & JDK22?
Doing some testing, I updated my performance testing for this PR to use JDK22 and now it is WAY slower, more than 2x slower, even for full-byte.
For int7, this branch is marginally faster (20%) with JDK21, but basically 2x slower on JDK22.
I wonder if our off-heap scoring for byte vectors also suffers on JDK22. The quantized scorer for int7 is just using those same methods.
To verify it wasn't some weird artifact in my code, I slightly changed it to where my execution path always reads the vectors on-heap and then wraps them in a memorysegment. Now JDK22 performs the same as JDK21 & the current baseline.
Its weird to me that reading from a memory segment onto ByteVector objects would be 2x slower on JDK22 than 21.
Regardless that its already much slower for the int4 case on both jdk 21 & 22.
Regardless that its already much slower for the int4 case on both jdk 21 & 22.
@benwtrent I was not aware, lemme take a look.
+1 to this feature
I work on Amazon product search, and in one of our searchers we see a high proportion of CPU cycles within HNSW search being spent in copying quantized vectors to heap:
Perhaps off-heap scoring could help us!
@kaivalnp feel free to take my initial work here and dig in deeper.
I haven't benchmarked it recently on later JVMs to figure out why I was experiencing such a weird slowdown when going off heap :/
Thanks @benwtrent! I opened #14863
I am gonna close this as work is progressing elsewhere, also, we should just move to off-heap bulk scoring ;)