lucene
lucene copied to clipboard
New JMH benchmark method - vdot8s that implement int8 dotProduct in C…
Credit: https://www.elastic.co/search-labs/blog/vector-similarity-computations-ludicrous-speed
Description
Implements vectorized dot product for byte[] vectors in native C code using SVE and Neon intrinsics.
TODOs
- Some more plumbing pending, marking as draft as work is incomplete
- The native 'C' is now moved to a separate
nativemodule and tests incorenow depend on it. Ideally I'd prefer to only enable this dependency on supported platforms but not sure how to do this in a clean way. - Native dot product is only enabled on
aarch64architecture and applicable only tobyte[]vectors. Adding additional conditions to enable/disable need to be added. - Rudimentary correctness testing of native code was done in
int main(...)indotProduct.c. This code should be converted to unit tests exercised using CUnit support in Gradle. - JMH benchmarks on other
Graviton2andApple M3need to be redone as some bugs in native code were fixed. - KNN benchmarks on scalar quantized vectors need to be done to assess indexing/querying impact.
NOTE
I had to export environment variable CC in my shell environment for Gradle to pickup the gcc-10 compiler toolchain. export CC=/usr/bin/gcc10-cc
Build Lucene
$ pwd
/home/goankur/open-source/lucene
$./gradlew build
Generates compiled shared library ${PWD}/lucene/native/build/libs/dotProduct/shared/libdotProduct.so
and JMH benchmarking JAR
Run Benchmark JAR
java --enable-native-access=ALL-UNNAMED \
--enable-preview \
-Djava.library.path="./lucene/native/build/libs/dotProduct/shared" \
-jar lucene/benchmark-jmh/build/benchmarks/lucene-benchmark-jmh-11.0.0-SNAPSHOT.jar \
regexp "(.*)?(binaryDotProductVector|dot8s)"
IGNORE -- These need to be redone
Summary
- Throughput gains relative to Panama API based java implementation of
int8dot-product.- Graviton2:
10X(NEON intrinsics) - Graviton3:
4.3X(SVE intrinsics) - Graviton4:
TBD - Apple M3 Pro:
8X(NEON intrinsics)
- Graviton2:
- On Graviton 2, performance of manually unrolled loop vectorized using
NEON intrinsicsis equivalent to auto-vectorized/auto-unrolled code.SVE intrinsicsare not available. - On Graviton3, manually unrolled loop vectorized using
SVE intrinsicsprovides+9.8%(score: 45.635) throughput gain on top of auto-vectorized/auto-unrolled code (score: 41.729). - [CAUTION]:
Graviton4results look suspicious and need to be re-evaluated. - On Apple M3 Pro, performance of manually unrolled loop vectorized using
NEON intrinsicsis2.38XFASTER than auto-vectorization/auto-unrolled code.SVE intrinsicsare not available.
Test Environment
- GCC/Clang flags:
--shared -O3 -march=native -funroll-loops
| AWS Instance Type | Operating System | Compiler | JDK | Gradle |
|---|---|---|---|---|
Graviton2 (m6g.4xlarge) |
Amazon Linux 2 | GCC 10.5.0 | Amazon Corretto 21 | Gradle 8.8 |
Graviton3 (m7g.4xlarge) |
Amazon Linux 2 | GCC 10.5.0 | Amazon Corretto 21 | Gradle 8.8 |
Graviton4 (r8g.4xlarge) |
Amazon Linux 2023 | GCC: 11.4.1 | Amazon Corretto 21 | Gradle 8.8 |
| Apple M3 | Mac OS Sonoma (14.5) | Apple clang 15.0.0 | OpenJDK 21.0.3 | Gradle 8.8 |
Results
Graviton 2 (m6g.4xlarge)
| Benchmark | (size) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| VectorUtilBenchmark.binaryDotProductVector | 768 | thrpt | 15 | 2.857 | ± 0.014 | ops/us |
| VectorUtilBenchmark.dot8s | 768 | thrpt | 15 | 28.580 | ± 0.189 | ops/us |
| VectorUtilBenchmark.neonVdot8s | 768 | thrpt | 15 | 28.722 | ± 0.454 | ops/us |
| VectorUtilBenchmark.sveVdot8s | 768 | thrpt | 15 | 28.892 | ± 0.340 | ops/us |
Graviton 3 (m7g.4xlarge)
| Benchmark | (size) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| VectorUtilBenchmark.binaryDotProductVector | 768 | thrpt | 15 | 10.563 | ± 0.004 | ops/us |
| VectorUtilBenchmark.dot8s | 768 | thrpt | 15 | 41.729 | ± 1.060 | ops/us |
| VectorUtilBenchmark.neonVdot8s | 768 | thrpt | 15 | 41.032 | ± 0.215 | ops/us |
| VectorUtilBenchmark.sveVdot8s | 768 | thrpt | 15 | 45.635 | ± 0.337 | ops/us |
[TBD] - Graviton 4 (r8g.4xlarge)
Apple M3 Pro
| Benchmark | (size) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| VectorUtilBenchmark.binaryDotProductVector | 768 | thrpt | 15 | 10.399 | ± 0.025 | ops/us |
| VectorUtilBenchmark.dot8s | 768 | thrpt | 15 | 34.288 | ± 0.062 | ops/us |
| VectorUtilBenchmark.neonVdot8s | 768 | thrpt | 15 | 81.013 | ± 0.569 | ops/us |
| VectorUtilBenchmark.sveVdot8s | 768 | thrpt | 15 | 80.657 | ± 0.760 | ops/us |
Do we even need to use intrinsics? function is so simple that the compiler seems to do the right thing, e.g. use SDOT dot production instruction, given the correct flags:
https://godbolt.org/z/KG1dPnrqn
https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-
I haven't benchmarked, just seems SDOT is the one to optimize for, and GCC can both recognize the code shape and autovectorize to it without hassle.
my cheap 2021 phone has asimddp feature in /proc/cpuinfo, dot product support seems widespread.
You can use it directly via intrinsic, too, no need to use add/multiply intrinsic: https://arm-software.github.io/acle/neon_intrinsics/advsimd.html#dot-product
But unless it is really faster than what GCC does with simple C, no need.
Do we even need to use intrinsics? function is so simple that the compiler seems to do the right thing, e.g. use
SDOTdot production instruction, given the correct flags:https://godbolt.org/z/KG1dPnrqn
https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-
I haven't benchmarked, just seems
SDOTis the one to optimize for, and GCC can both recognize the code shape and autovectorize to it without hassle.my cheap 2021 phone has
asimddpfeature in /proc/cpuinfo, dot product support seems widespread.You can use it directly via intrinsic, too, no need to use add/multiply intrinsic: https://arm-software.github.io/acle/neon_intrinsics/advsimd.html#dot-product
But unless it is really faster than what GCC does with simple C, no need.
With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have 10X better throughout compared to the Panama API based java implementation (unless I am not doing apples-to-apples comparison).
With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have
10Xbetter throughout compared to the Panama API based java implementation (unless I am not doing apples-to-apples comparison).
This seems correct to me. The java Vector API is not performant for the integer case. Hotspot doesn't much understand ARM at all and definitely doesn't even have instructions such as SDOT in its vocabulary at all.
With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have
10Xbetter throughout compared to the Panama API based java implementation (unless I am not doing apples-to-apples comparison).This seems correct to me.
Same. The performance benefit is large here. We do similar in Elasticsearch, and have impls for both AArch64 and x64. I remember @rmuir making a similar comment before about the auto-vectorization - which is correct. I avoided it at the time given the toolchain that we were using, but it's a good option which I'll reevaluate.
https://github.com/elastic/elasticsearch/blob/main/libs/simdvec/native/src/vec/c/aarch64/vec.c
The java Vector API is not performant for the integer case. Hotspot doesn't much understand ARM at all and definitely doesn't even have instructions such as
SDOTin its vocabulary at all.
++ Yes. This is not a good state of affairs. I'll make sure to get an issue filed with OpenJDK for it.
Could Lucene ever have this directly in one of its modules? We currently use the FlatVectorsScorer to plugin the "native code optimized" alternative, when scoring Scalar Quantized vectors. But Lucene misses out on this by default, which is a pity.
Note: we currently do dot product and square distance on int7, since this gives a bit more flexibility on x64, and Lucene's scalar quantization is in the range of 0 - 127. (https://github.com/apache/lucene/pull/13336)
I avoided it at the time given the toolchain that we were using, but it's a good option which I'll reevaluate.
It should work well with any modern gcc (@goankur uses gcc 10 here). If using clang, I think you need to add some pragmas for it, that's all, see https://llvm.org/docs/Vectorizers.html and also there are some hints in the ARM docs.
One issue would be which to compile for/detect:
IMO for arm there would be two good ones:
- armv8.2-a+dotprod (NEON)
- armv9-a (SVE)
For x86 probably at most three?:
- x86-64-v3 (AVX2)
- x86-64-v4 (AVX512)
- VNNI (should be tested but seems fit for the task? but not widespread? cascade-lake+?)
Here is my proposal visually: https://godbolt.org/z/6fcjPWojf
As you can see, by passing -march=cascadelake it generates VNNI instructions.
IMO, no need for any intrinsics anywhere, for x86 nor ARM. Just a dead-simple C function and being smart about which targets we compile for.
(these variants should all be benchmarked of course first)
edit: I modified it to use same GCC compiler version across all variants. This newer version unrolls the SVE, too.
And i see from playing around with compiler versions, the advantage of intrinsics approach: although I worry how many variants we'd maintain. it would give stability across releasing lucene without worrying about change to underlying C compiler etc, no crazy aggressive compiler flags needed, etc.
But we should at least START with autovectorizer and look/benchmark various approaches it uses, such as VNNI one in the example above.
I definitely want to play around more with @goankur 's PR here and see what performance looks like across machines, but will be out of town for a bit.
There is a script to run the benchmarks across every aws instance type in parallel and gather the results: https://github.com/apache/lucene/tree/main/dev-tools/aws-jmh
I want to tweak https://github.com/apache/lucene/blob/main/dev-tools/aws-jmh/group_vars/all.yml to add the graviton4 (or any other new instance types since it was last touched) and also add c-compiler for testing this out, but we should be able to quickly iterate with it and make progress.
Just to expand a little on a previous comment I made above.
Could Lucene ever have this directly in one of its modules?
An alternative option to putting this in core, is to put it in say misc, allowing users creating KnnVectorsFormat to hook into it through the Lucene99FlatVectorsFormat and FlatVectorsScorer. That way it would be somewhat optional in the build and distribution.
An alternative option to putting this in core, is to put it in say misc, allowing users creating KnnVectorsFormat to hook into it through the Lucene99FlatVectorsFormat and FlatVectorsScorer. That way it would be somewhat optional in the build and distribution.
There are a few issues with distributing native-built methods that I can see. First, building becomes more complicated - you need a compiler, the build becomes more different on different platforms (can we support windows?), and when building you might have complex needs like targeting a different platform (or more platforms) than you are building on. Aside from building, there is the question of distributing the binaries and linking them into the runtime. I guess some command-line option is required in order to locate the .so (on linux) or .dll or whatever? And there can be environment-variable versions of this. So we would have to decide how much hand-holding to do. Finally we'd want to build the java code so it can fall back in a reasonable way when the native impl is not present. But none of this seems insurmountable. I believe Lucene previously had some native code libraries in its distribution and we could do so again. I'm not sure about the core vs misc distinction, I'd be OK either way, although I expect we want to keep core "clean"?
With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have
10Xbetter throughout compared to the Panama API based java implementation (unless I am not doing apples-to-apples comparison).This seems correct to me.
Same. The performance benefit is large here. We do similar in Elasticsearch, and have impls for both AArch64 and x64. I remember @rmuir making a similar comment before about the auto-vectorization - which is correct. I avoided it at the time given the toolchain that we were using, but it's a good option which I'll reevaluate.
https://github.com/elastic/elasticsearch/blob/main/libs/simdvec/native/src/vec/c/aarch64/vec.c
The java Vector API is not performant for the integer case. Hotspot doesn't much understand ARM at all and definitely doesn't even have instructions such as
SDOTin its vocabulary at all.++ Yes. This is not a good state of affairs. I'll make sure to get an issue filed with OpenJDK for it.
Thanks Chris. Please share the link to OpenJDK issue when you get a chance.
I definitely want to play around more with @goankur 's PR here and see what performance looks like across machines, but will be out of town for a bit.
There is a script to run the benchmarks across every aws instance type in parallel and gather the results: https://github.com/apache/lucene/tree/main/dev-tools/aws-jmh
I want to tweak https://github.com/apache/lucene/blob/main/dev-tools/aws-jmh/group_vars/all.yml to add the graviton4 (or any other new instance types since it was last touched) and also add c-compiler for testing this out, but we should be able to quickly iterate with it and make progress.
Let me try to squeeze some cycles out next week and see if I can make progress on this front.
go @goankur, awesome progress here. It is clear we gotta do something :) I left comments just to try to help. Do you mind avoiding rebase for updates? I am going to take a stab at the x86 side of the house.
Attached is a patch to get x86 support working.
It makes some changes to the build: specifically the java code statically picks the best MethodHandle (SVE, Neon, Generic), and its able to compile Generic on any architecture/compiler (e.g. x86). I added necessary guards to the arm SVE/Neon code for this to work.
Changes were necessary to dodge performance traps with MethodHandle or use wrong compiler flags. But I think it makes the build more straightforward: it builds native as you expect, if you want to use different compiler set CC env vars etc differently.
Just with generic C code I see a good lift on avx-256 skylake:
Benchmark (size) Mode Cnt Score Error Units
VectorUtilBenchmark.binaryDotProductVector 768 thrpt 15 10.019 ± 0.080 ops/us
VectorUtilBenchmark.dot8s 768 thrpt 15 17.492 ± 0.359 ops/us
Next step is to add x86 intrinsics (e.g. VNNI) and try to max out the hardware.
TODO: need to examine avx256 difference of auto-vectorized C with vs java vector api for the integers here. This isn't nearly as bad as the ARM case (where we understand many of the underlying problems already), but I'm surprised to see such a lift, would be good to understand it.
also, would be good to compare apples-to-apples here. currently from what i see, benchmark compares dot8s(MemorySegment..) vs BinaryDotProduct(byte[]). To me this mixes up concerns about memorysegment vs java heap with C vs vector API, so it would be good to resolve for understanding purposes.
Hey Robert — Apologies for the delay and thanks for iterating with me on this one. I will incorporate your feedback and update this PR by Aug 1st, 2024.
But I think it makes the build more straightforward: it builds native as you expect, if you want to use different compiler set
CCenv vars etc differently.
I still get compilation problems on Graviton 3 with toolchain specific changes to lucene/core/build.gradle. I haven't tested on Graviton 2 but pretty sure I would face the same problem.
### Steps to reproduce on Graviton 3 box
$ wget https://github.com/user-attachments/files/16449392/x86.patch.txt .
$ git apply --include=lucene/core/build.gradle x86.patch.txt
$ git diff
.... Shows the diff has been correctly applied to lucene/core/build.gradle
Compiling with ./gradlew :lucene:core:build throws the error below
> Task :lucene:core:compileDotProductSharedLibraryDotProductC FAILED
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':lucene:core:compileDotProductSharedLibraryDotProductC'.
> Error while evaluating property 'compilerVersion' of task ':lucene:core:compileDotProductSharedLibraryDotProductC'.
> No tool chain is available to build for platform 'linux_aarch64':
- Tool chain 'visualCpp' (Visual Studio):
- Visual Studio is not available on this operating system.
- Tool chain 'gcc' (GNU GCC):
- Don't know how to build for platform 'linux_aarch64'.
- Tool chain 'clang' (Clang):
- Don't know how to build for platform 'linux_aarch64'.
* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Get more help at https://help.gradle.org.
That is the fault of your graviton setup though, you can see it right in your previous platforms logic where it is invoking cCompiler.executable 'gcc10-cc'.
Fix GCC to be in path correctly and things will work.
java --enable-native-access=ALL-UNNAMED
--enable-preview
-Djava.library.path="./lucene/core/build/libs/dotProduct/shared"
-jar lucene/benchmark-jmh/build/benchmarks/lucene-benchmark-jmh-10.0.0-SNAPSHOT.jar
regexp "(.*)?(binaryDotProductVector|dot8s)"
That is the fault of your graviton setup though, you can see it right in your previous
platformslogic where it is invokingcCompiler.executable 'gcc10-cc'.Fix GCC to be in path correctly and things will work.
Ok, So I made sure ‘/usr/bin’ consisting of gcc10 and gcc was on the PATH, exported CC=gcc10 environment variable and cleared gradle's config cache by executing ./gradlew clean but the I still get the same compilation error:
...
Tool chain 'gcc' (GNU GCC):
- Don't know how to build for platform 'linux_aarch64'.
...
Seems like CC environment variable is not being honored.
From Gradle's documentation :
So for GCC running on linux, the supported target platforms are 'linux/x86' and 'linux/x86_64'.
For GCC running on Windows via Cygwin, platforms 'windows/x86' and 'windows/x86_64' are supported.
(The Cygwin POSIX runtime is not yet modelled as part of the platform, but will be in the future.)
If no target platforms are defined for a project, then all binaries are built to target a default platform
named 'current'. This default platform does not specify any architecture or operatingSystem value,
hence using the default values of the first available tool chain.
I could be misinterpreting things but judging by the error, gradle's default toolchain pick - gcc does not know how to build for target platform - linux_aarch64. To fix this problem, the best I could do (also reflected in latest commit) is below
toolChains {
gcc(Gcc) {
target("linux_aarch64") {
cCompiler.executable = System.getenv("CC")
}
}
clang(Clang) {
target("osx_aarch64"){
cCompiler.executable = System.getenv("CC")
}
}
}
Anyways, it is still possible that I am making a mistake somewhere and I am gonna revisit this once I am back after a week or so.
@rmuir - I am going to be out for the next week so please feel free to play with it and further refine the code.
- Once we get some performance numbers from adding
x86 intrinsics, I think, the next step would be to test the impact on HNSW searching/indexing withint8quantized vectors. - Assuming things look promising, we will need community feedback on native code acceptance.
- If there is appetite to having it within Lucene (core/native/misc) then we will need tests + documentation + cleanup before this can be merged .
sorry @goankur i'm super-behind, vacations and end of summer. i definitely want to try out the VNNI path but just haven't found the cycles yet.
also, would be good to compare apples-to-apples here. currently from what i see, benchmark compares
dot8s(MemorySegment..)vsBinaryDotProduct(byte[]). To me this mixes up concerns about memorysegment vs java heap with C vs vector API, so it would be good to resolve for understanding purposes.
As I understand, here is the call stack for BinaryDotProduct(byte[])
PanamaVectorUtilSupport.dotProduct(MemorySegment, MemorySegment)
PanamaVectorUtilSupport.dotProduct(byte[], byte[])
VectorUtil.dotProduct(byte[], byte[])
VectorUtilBenchmark.binaryDotProductVector()
So heap allocated byte[] gets wrapped into MemorySegment before the dotProduct computation that uses Panama APIs.
For strict apples-to-apples comparison, one would wrap the heap-allocated byte[] into a MemorySegment and pass it to the native code -- dot8s(MemorySegment..) -- but this does not work and throws the following exception
Caused by: java.lang.IllegalArgumentException: Heap segment not allowed: MemorySegment{ heapBase: Optional[[B@53a133c5] address:0 limit: 768 }
at java.base/jdk.internal.foreign.abi.SharedUtils.unboxSegment(SharedUtils.java:331)
at org.apache.lucene.util.VectorUtil.dot8s(VectorUtil.java:217)
... 13 more
So I can't think of a good way to make the comparison fairer. Do you have any recommendations ?
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
Could Lucene ever have this directly in one of its modules? We currently use the
FlatVectorsScorerto plugin the "native code optimized" alternative, when scoring Scalar Quantized vectors. But Lucene misses out on this by default, which is a pity.
+1, it's horrible that Lucene cannot offer such performance gains. We are still bottlenecked by the java vector API limitations ...
I like the idea of adding native code options under lucene/misc module? And if FlatVectorScorer is easily swapped to the native optimized version, via custom Codec with examples in misc, that seems like a viable path forward? It would still not be Lucene's default Codec, but it would at least be an option for users?
It makes some changes to the build: specifically the java code statically picks the best MethodHandle (SVE, Neon, Generic), and its able to compile Generic on any architecture/compiler (e.g. x86). I added necessary guards to the arm SVE/Neon code for this to work.
Whoa, this is cool! So with @rmuir's patch we have the necessary switching at run-time (one time static check) to pick the right native (compiled C code relying on good gcc auto-vectorization) implementation of dotProduct, falling back to a generic one if none of the specialized ones apply.
There is a script to run the benchmarks across every aws instance type in parallel and gather the results: https://github.com/apache/lucene/tree/main/dev-tools/aws-jmh
Holy smokes this is a cool benchy tool! Imagine if Lucene's nightly benchy could use this to run on all hardware... we'd be able to catch hardware/environment specific odd Lucene regressions.
Once we get some performance numbers from adding x86 intrinsics, I think, the next step would be to test the impact on HNSW searching/indexing with int8 quantized vectors.
Can we test on int7?
int8 is unfortunately currently broken, and likely we should remove it as an option for 10.0.0, and later add it back if/when we can get the complicated math (see discussion on that issue) ironed out.