lucene icon indicating copy to clipboard operation
lucene copied to clipboard

New JMH benchmark method - vdot8s that implement int8 dotProduct in C…

Open goankur opened this issue 1 year ago • 36 comments

Credit: https://www.elastic.co/search-labs/blog/vector-similarity-computations-ludicrous-speed

Description

Implements vectorized dot product for byte[] vectors in native C code using SVE and Neon intrinsics.

TODOs

  • Some more plumbing pending, marking as draft as work is incomplete
  • The native 'C' is now moved to a separate native module and tests in core now depend on it. Ideally I'd prefer to only enable this dependency on supported platforms but not sure how to do this in a clean way.
  • Native dot product is only enabled on aarch64 architecture and applicable only to byte[] vectors. Adding additional conditions to enable/disable need to be added.
  • Rudimentary correctness testing of native code was done in int main(...) in dotProduct.c. This code should be converted to unit tests exercised using CUnit support in Gradle.
  • JMH benchmarks on other Graviton2 and Apple M3 need to be redone as some bugs in native code were fixed.
  • KNN benchmarks on scalar quantized vectors need to be done to assess indexing/querying impact.

NOTE

I had to export environment variable CC in my shell environment for Gradle to pickup the gcc-10 compiler toolchain. export CC=/usr/bin/gcc10-cc

Build Lucene

$ pwd
/home/goankur/open-source/lucene

$./gradlew  build

Generates compiled shared library ${PWD}/lucene/native/build/libs/dotProduct/shared/libdotProduct.so and JMH benchmarking JAR

Run Benchmark JAR

java --enable-native-access=ALL-UNNAMED \
       --enable-preview \
       -Djava.library.path="./lucene/native/build/libs/dotProduct/shared" \
       -jar lucene/benchmark-jmh/build/benchmarks/lucene-benchmark-jmh-11.0.0-SNAPSHOT.jar \
      regexp "(.*)?(binaryDotProductVector|dot8s)"

IGNORE -- These need to be redone

Summary

  • Throughput gains relative to Panama API based java implementation of int8 dot-product.
    • Graviton2: 10X (NEON intrinsics)
    • Graviton3: 4.3X (SVE intrinsics)
    • Graviton4: TBD
    • Apple M3 Pro: 8X (NEON intrinsics)
  • On Graviton 2, performance of manually unrolled loop vectorized using NEON intrinsics is equivalent to auto-vectorized/auto-unrolled code. SVE intrinsics are not available.
  • On Graviton3, manually unrolled loop vectorized using SVE intrinsics provides +9.8% (score: 45.635) throughput gain on top of auto-vectorized/auto-unrolled code (score: 41.729).
  • [CAUTION]: Graviton4 results look suspicious and need to be re-evaluated.
  • On Apple M3 Pro, performance of manually unrolled loop vectorized using NEON intrinsics is 2.38X FASTER than auto-vectorization/auto-unrolled code. SVE intrinsics are not available.

Test Environment

  • GCC/Clang flags: --shared -O3 -march=native -funroll-loops
AWS Instance Type Operating System Compiler JDK Gradle
Graviton2 (m6g.4xlarge) Amazon Linux 2 GCC 10.5.0 Amazon Corretto 21 Gradle 8.8
Graviton3 (m7g.4xlarge) Amazon Linux 2 GCC 10.5.0 Amazon Corretto 21 Gradle 8.8
Graviton4 (r8g.4xlarge) Amazon Linux 2023 GCC: 11.4.1 Amazon Corretto 21 Gradle 8.8
Apple M3 Mac OS Sonoma (14.5) Apple clang 15.0.0 OpenJDK 21.0.3 Gradle 8.8

Results

Graviton 2 (m6g.4xlarge)

Benchmark (size) Mode Cnt Score Error Units
VectorUtilBenchmark.binaryDotProductVector 768 thrpt 15 2.857 ± 0.014 ops/us
VectorUtilBenchmark.dot8s 768 thrpt 15 28.580 ± 0.189 ops/us
VectorUtilBenchmark.neonVdot8s 768 thrpt 15 28.722 ± 0.454 ops/us
VectorUtilBenchmark.sveVdot8s 768 thrpt 15 28.892 ± 0.340 ops/us

Graviton 3 (m7g.4xlarge)

Benchmark (size) Mode Cnt Score Error Units
VectorUtilBenchmark.binaryDotProductVector 768 thrpt 15 10.563 ± 0.004 ops/us
VectorUtilBenchmark.dot8s 768 thrpt 15 41.729 ± 1.060 ops/us
VectorUtilBenchmark.neonVdot8s 768 thrpt 15 41.032 ± 0.215 ops/us
VectorUtilBenchmark.sveVdot8s 768 thrpt 15 45.635 ± 0.337 ops/us

[TBD] - Graviton 4 (r8g.4xlarge)


Apple M3 Pro

Benchmark (size) Mode Cnt Score Error Units
VectorUtilBenchmark.binaryDotProductVector 768 thrpt 15 10.399 ± 0.025 ops/us
VectorUtilBenchmark.dot8s 768 thrpt 15 34.288 ± 0.062 ops/us
VectorUtilBenchmark.neonVdot8s 768 thrpt 15 81.013 ± 0.569 ops/us
VectorUtilBenchmark.sveVdot8s 768 thrpt 15 80.657 ± 0.760 ops/us

goankur avatar Jul 16 '24 01:07 goankur

Do we even need to use intrinsics? function is so simple that the compiler seems to do the right thing, e.g. use SDOT dot production instruction, given the correct flags:

https://godbolt.org/z/KG1dPnrqn

https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-

rmuir avatar Jul 16 '24 05:07 rmuir

I haven't benchmarked, just seems SDOT is the one to optimize for, and GCC can both recognize the code shape and autovectorize to it without hassle.

my cheap 2021 phone has asimddp feature in /proc/cpuinfo, dot product support seems widespread.

You can use it directly via intrinsic, too, no need to use add/multiply intrinsic: https://arm-software.github.io/acle/neon_intrinsics/advsimd.html#dot-product

But unless it is really faster than what GCC does with simple C, no need.

rmuir avatar Jul 16 '24 05:07 rmuir

Do we even need to use intrinsics? function is so simple that the compiler seems to do the right thing, e.g. use SDOT dot production instruction, given the correct flags:

https://godbolt.org/z/KG1dPnrqn

https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-

I haven't benchmarked, just seems SDOT is the one to optimize for, and GCC can both recognize the code shape and autovectorize to it without hassle.

my cheap 2021 phone has asimddp feature in /proc/cpuinfo, dot product support seems widespread.

You can use it directly via intrinsic, too, no need to use add/multiply intrinsic: https://arm-software.github.io/acle/neon_intrinsics/advsimd.html#dot-product

But unless it is really faster than what GCC does with simple C, no need.

With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have 10X better throughout compared to the Panama API based java implementation (unless I am not doing apples-to-apples comparison).

goankur avatar Jul 18 '24 02:07 goankur

With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have 10X better throughout compared to the Panama API based java implementation (unless I am not doing apples-to-apples comparison).

This seems correct to me. The java Vector API is not performant for the integer case. Hotspot doesn't much understand ARM at all and definitely doesn't even have instructions such as SDOT in its vocabulary at all.

rmuir avatar Jul 18 '24 02:07 rmuir

With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have 10X better throughout compared to the Panama API based java implementation (unless I am not doing apples-to-apples comparison).

This seems correct to me.

Same. The performance benefit is large here. We do similar in Elasticsearch, and have impls for both AArch64 and x64. I remember @rmuir making a similar comment before about the auto-vectorization - which is correct. I avoided it at the time given the toolchain that we were using, but it's a good option which I'll reevaluate.

https://github.com/elastic/elasticsearch/blob/main/libs/simdvec/native/src/vec/c/aarch64/vec.c

The java Vector API is not performant for the integer case. Hotspot doesn't much understand ARM at all and definitely doesn't even have instructions such as SDOT in its vocabulary at all.

++ Yes. This is not a good state of affairs. I'll make sure to get an issue filed with OpenJDK for it.

ChrisHegarty avatar Jul 18 '24 13:07 ChrisHegarty

Could Lucene ever have this directly in one of its modules? We currently use the FlatVectorsScorer to plugin the "native code optimized" alternative, when scoring Scalar Quantized vectors. But Lucene misses out on this by default, which is a pity.

Note: we currently do dot product and square distance on int7, since this gives a bit more flexibility on x64, and Lucene's scalar quantization is in the range of 0 - 127. (https://github.com/apache/lucene/pull/13336)

ChrisHegarty avatar Jul 18 '24 13:07 ChrisHegarty

I avoided it at the time given the toolchain that we were using, but it's a good option which I'll reevaluate.

It should work well with any modern gcc (@goankur uses gcc 10 here). If using clang, I think you need to add some pragmas for it, that's all, see https://llvm.org/docs/Vectorizers.html and also there are some hints in the ARM docs.

One issue would be which to compile for/detect:

IMO for arm there would be two good ones:

  • armv8.2-a+dotprod (NEON)
  • armv9-a (SVE)

For x86 probably at most three?:

  • x86-64-v3 (AVX2)
  • x86-64-v4 (AVX512)
  • VNNI (should be tested but seems fit for the task? but not widespread? cascade-lake+?)

rmuir avatar Jul 18 '24 14:07 rmuir

Here is my proposal visually: https://godbolt.org/z/6fcjPWojf

As you can see, by passing -march=cascadelake it generates VNNI instructions. IMO, no need for any intrinsics anywhere, for x86 nor ARM. Just a dead-simple C function and being smart about which targets we compile for.

(these variants should all be benchmarked of course first)

edit: I modified it to use same GCC compiler version across all variants. This newer version unrolls the SVE, too.

rmuir avatar Jul 18 '24 14:07 rmuir

And i see from playing around with compiler versions, the advantage of intrinsics approach: although I worry how many variants we'd maintain. it would give stability across releasing lucene without worrying about change to underlying C compiler etc, no crazy aggressive compiler flags needed, etc.

But we should at least START with autovectorizer and look/benchmark various approaches it uses, such as VNNI one in the example above.

rmuir avatar Jul 18 '24 15:07 rmuir

I definitely want to play around more with @goankur 's PR here and see what performance looks like across machines, but will be out of town for a bit.

There is a script to run the benchmarks across every aws instance type in parallel and gather the results: https://github.com/apache/lucene/tree/main/dev-tools/aws-jmh

I want to tweak https://github.com/apache/lucene/blob/main/dev-tools/aws-jmh/group_vars/all.yml to add the graviton4 (or any other new instance types since it was last touched) and also add c-compiler for testing this out, but we should be able to quickly iterate with it and make progress.

rmuir avatar Jul 18 '24 15:07 rmuir

Just to expand a little on a previous comment I made above.

Could Lucene ever have this directly in one of its modules?

An alternative option to putting this in core, is to put it in say misc, allowing users creating KnnVectorsFormat to hook into it through the Lucene99FlatVectorsFormat and FlatVectorsScorer. That way it would be somewhat optional in the build and distribution.

ChrisHegarty avatar Jul 19 '24 09:07 ChrisHegarty

An alternative option to putting this in core, is to put it in say misc, allowing users creating KnnVectorsFormat to hook into it through the Lucene99FlatVectorsFormat and FlatVectorsScorer. That way it would be somewhat optional in the build and distribution.

There are a few issues with distributing native-built methods that I can see. First, building becomes more complicated - you need a compiler, the build becomes more different on different platforms (can we support windows?), and when building you might have complex needs like targeting a different platform (or more platforms) than you are building on. Aside from building, there is the question of distributing the binaries and linking them into the runtime. I guess some command-line option is required in order to locate the .so (on linux) or .dll or whatever? And there can be environment-variable versions of this. So we would have to decide how much hand-holding to do. Finally we'd want to build the java code so it can fall back in a reasonable way when the native impl is not present. But none of this seems insurmountable. I believe Lucene previously had some native code libraries in its distribution and we could do so again. I'm not sure about the core vs misc distinction, I'd be OK either way, although I expect we want to keep core "clean"?

msokolov avatar Jul 19 '24 12:07 msokolov

With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have 10X better throughout compared to the Panama API based java implementation (unless I am not doing apples-to-apples comparison).

This seems correct to me.

Same. The performance benefit is large here. We do similar in Elasticsearch, and have impls for both AArch64 and x64. I remember @rmuir making a similar comment before about the auto-vectorization - which is correct. I avoided it at the time given the toolchain that we were using, but it's a good option which I'll reevaluate.

https://github.com/elastic/elasticsearch/blob/main/libs/simdvec/native/src/vec/c/aarch64/vec.c

The java Vector API is not performant for the integer case. Hotspot doesn't much understand ARM at all and definitely doesn't even have instructions such as SDOT in its vocabulary at all.

++ Yes. This is not a good state of affairs. I'll make sure to get an issue filed with OpenJDK for it.

Thanks Chris. Please share the link to OpenJDK issue when you get a chance.

goankur avatar Jul 20 '24 02:07 goankur

I definitely want to play around more with @goankur 's PR here and see what performance looks like across machines, but will be out of town for a bit.

There is a script to run the benchmarks across every aws instance type in parallel and gather the results: https://github.com/apache/lucene/tree/main/dev-tools/aws-jmh

I want to tweak https://github.com/apache/lucene/blob/main/dev-tools/aws-jmh/group_vars/all.yml to add the graviton4 (or any other new instance types since it was last touched) and also add c-compiler for testing this out, but we should be able to quickly iterate with it and make progress.

Let me try to squeeze some cycles out next week and see if I can make progress on this front.

goankur avatar Jul 20 '24 02:07 goankur

go @goankur, awesome progress here. It is clear we gotta do something :) I left comments just to try to help. Do you mind avoiding rebase for updates? I am going to take a stab at the x86 side of the house.

rmuir avatar Aug 01 '24 01:08 rmuir

Attached is a patch to get x86 support working.

It makes some changes to the build: specifically the java code statically picks the best MethodHandle (SVE, Neon, Generic), and its able to compile Generic on any architecture/compiler (e.g. x86). I added necessary guards to the arm SVE/Neon code for this to work.

Changes were necessary to dodge performance traps with MethodHandle or use wrong compiler flags. But I think it makes the build more straightforward: it builds native as you expect, if you want to use different compiler set CC env vars etc differently.

Just with generic C code I see a good lift on avx-256 skylake:

Benchmark                                   (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryDotProductVector     768  thrpt   15  10.019 ± 0.080  ops/us
VectorUtilBenchmark.dot8s                      768  thrpt   15  17.492 ± 0.359  ops/us

Next step is to add x86 intrinsics (e.g. VNNI) and try to max out the hardware.

x86.patch.txt

rmuir avatar Aug 01 '24 03:08 rmuir

TODO: need to examine avx256 difference of auto-vectorized C with vs java vector api for the integers here. This isn't nearly as bad as the ARM case (where we understand many of the underlying problems already), but I'm surprised to see such a lift, would be good to understand it.

rmuir avatar Aug 01 '24 03:08 rmuir

also, would be good to compare apples-to-apples here. currently from what i see, benchmark compares dot8s(MemorySegment..) vs BinaryDotProduct(byte[]). To me this mixes up concerns about memorysegment vs java heap with C vs vector API, so it would be good to resolve for understanding purposes.

rmuir avatar Aug 01 '24 03:08 rmuir

Hey Robert — Apologies for the delay and thanks for iterating with me on this one. I will incorporate your feedback and update this PR by Aug 1st, 2024.

goankur avatar Aug 01 '24 05:08 goankur

But I think it makes the build more straightforward: it builds native as you expect, if you want to use different compiler set CC env vars etc differently.

I still get compilation problems on Graviton 3 with toolchain specific changes to lucene/core/build.gradle. I haven't tested on Graviton 2 but pretty sure I would face the same problem.

### Steps to reproduce on Graviton 3 box

$ wget https://github.com/user-attachments/files/16449392/x86.patch.txt .
$ git apply --include=lucene/core/build.gradle x86.patch.txt
$ git diff

 .... Shows the diff has been correctly applied to lucene/core/build.gradle

Compiling with ./gradlew :lucene:core:build throws the error below

> Task :lucene:core:compileDotProductSharedLibraryDotProductC FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':lucene:core:compileDotProductSharedLibraryDotProductC'.
> Error while evaluating property 'compilerVersion' of task ':lucene:core:compileDotProductSharedLibraryDotProductC'.
   > No tool chain is available to build for platform 'linux_aarch64':
       - Tool chain 'visualCpp' (Visual Studio):
           - Visual Studio is not available on this operating system.
       - Tool chain 'gcc' (GNU GCC):
           - Don't know how to build for platform 'linux_aarch64'.
       - Tool chain 'clang' (Clang):
           - Don't know how to build for platform 'linux_aarch64'.

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Get more help at https://help.gradle.org.

goankur avatar Aug 02 '24 03:08 goankur

That is the fault of your graviton setup though, you can see it right in your previous platforms logic where it is invoking cCompiler.executable 'gcc10-cc'.

Fix GCC to be in path correctly and things will work.

rmuir avatar Aug 02 '24 11:08 rmuir

java --enable-native-access=ALL-UNNAMED
--enable-preview
-Djava.library.path="./lucene/core/build/libs/dotProduct/shared"
-jar lucene/benchmark-jmh/build/benchmarks/lucene-benchmark-jmh-10.0.0-SNAPSHOT.jar
regexp "(.*)?(binaryDotProductVector|dot8s)"

That is the fault of your graviton setup though, you can see it right in your previous platforms logic where it is invoking cCompiler.executable 'gcc10-cc'.

Fix GCC to be in path correctly and things will work.

Ok, So I made sure ‘/usr/bin’ consisting of gcc10 and gcc was on the PATH, exported CC=gcc10 environment variable and cleared gradle's config cache by executing ./gradlew clean but the I still get the same compilation error:

...
Tool chain 'gcc' (GNU GCC):
           - Don't know how to build for platform 'linux_aarch64'.
 ...

Seems like CC environment variable is not being honored.

From Gradle's documentation :

So for GCC running on linux, the supported target platforms are 'linux/x86' and 'linux/x86_64'. 
For GCC running on Windows via Cygwin, platforms 'windows/x86' and 'windows/x86_64' are supported. 
(The Cygwin POSIX runtime is not yet modelled as part of the platform, but will be in the future.)

If no target platforms are defined for a project, then all binaries are built to target a default platform
named 'current'. This default platform does not specify any architecture or operatingSystem value, 
hence using the default values of the first available tool chain.

I could be misinterpreting things but judging by the error, gradle's default toolchain pick - gcc does not know how to build for target platform - linux_aarch64. To fix this problem, the best I could do (also reflected in latest commit) is below

toolChains {
     gcc(Gcc) {
          target("linux_aarch64") {
              cCompiler.executable = System.getenv("CC")
          }
      }
     clang(Clang) {
         target("osx_aarch64"){
             cCompiler.executable = System.getenv("CC")
         }
     }
 }

Anyways, it is still possible that I am making a mistake somewhere and I am gonna revisit this once I am back after a week or so.

goankur avatar Aug 03 '24 02:08 goankur

@rmuir - I am going to be out for the next week so please feel free to play with it and further refine the code.

  • Once we get some performance numbers from adding x86 intrinsics, I think, the next step would be to test the impact on HNSW searching/indexing with int8 quantized vectors.
  • Assuming things look promising, we will need community feedback on native code acceptance.
  • If there is appetite to having it within Lucene (core/native/misc) then we will need tests + documentation + cleanup before this can be merged .

goankur avatar Aug 03 '24 03:08 goankur

sorry @goankur i'm super-behind, vacations and end of summer. i definitely want to try out the VNNI path but just haven't found the cycles yet.

rmuir avatar Aug 14 '24 19:08 rmuir

also, would be good to compare apples-to-apples here. currently from what i see, benchmark compares dot8s(MemorySegment..) vs BinaryDotProduct(byte[]). To me this mixes up concerns about memorysegment vs java heap with C vs vector API, so it would be good to resolve for understanding purposes.

As I understand, here is the call stack for BinaryDotProduct(byte[])

PanamaVectorUtilSupport.dotProduct(MemorySegment, MemorySegment) PanamaVectorUtilSupport.dotProduct(byte[], byte[]) VectorUtil.dotProduct(byte[], byte[]) VectorUtilBenchmark.binaryDotProductVector()

So heap allocated byte[] gets wrapped into MemorySegment before the dotProduct computation that uses Panama APIs.

For strict apples-to-apples comparison, one would wrap the heap-allocated byte[] into a MemorySegment and pass it to the native code -- dot8s(MemorySegment..) -- but this does not work and throws the following exception

Caused by: java.lang.IllegalArgumentException: Heap segment not allowed: MemorySegment{ heapBase: Optional[[B@53a133c5] address:0 limit: 768 }
        at java.base/jdk.internal.foreign.abi.SharedUtils.unboxSegment(SharedUtils.java:331)
        at org.apache.lucene.util.VectorUtil.dot8s(VectorUtil.java:217)
        ... 13 more

So I can't think of a good way to make the comparison fairer. Do you have any recommendations ?

goankur avatar Aug 17 '24 00:08 goankur

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Sep 01 '24 00:09 github-actions[bot]

Could Lucene ever have this directly in one of its modules? We currently use the FlatVectorsScorer to plugin the "native code optimized" alternative, when scoring Scalar Quantized vectors. But Lucene misses out on this by default, which is a pity.

+1, it's horrible that Lucene cannot offer such performance gains. We are still bottlenecked by the java vector API limitations ...

I like the idea of adding native code options under lucene/misc module? And if FlatVectorScorer is easily swapped to the native optimized version, via custom Codec with examples in misc, that seems like a viable path forward? It would still not be Lucene's default Codec, but it would at least be an option for users?

mikemccand avatar Sep 05 '24 15:09 mikemccand

It makes some changes to the build: specifically the java code statically picks the best MethodHandle (SVE, Neon, Generic), and its able to compile Generic on any architecture/compiler (e.g. x86). I added necessary guards to the arm SVE/Neon code for this to work.

Whoa, this is cool! So with @rmuir's patch we have the necessary switching at run-time (one time static check) to pick the right native (compiled C code relying on good gcc auto-vectorization) implementation of dotProduct, falling back to a generic one if none of the specialized ones apply.

mikemccand avatar Sep 05 '24 15:09 mikemccand

There is a script to run the benchmarks across every aws instance type in parallel and gather the results: https://github.com/apache/lucene/tree/main/dev-tools/aws-jmh

Holy smokes this is a cool benchy tool! Imagine if Lucene's nightly benchy could use this to run on all hardware... we'd be able to catch hardware/environment specific odd Lucene regressions.

mikemccand avatar Sep 05 '24 15:09 mikemccand

Once we get some performance numbers from adding x86 intrinsics, I think, the next step would be to test the impact on HNSW searching/indexing with int8 quantized vectors.

Can we test on int7?

int8 is unfortunately currently broken, and likely we should remove it as an option for 10.0.0, and later add it back if/when we can get the complicated math (see discussion on that issue) ironed out.

mikemccand avatar Sep 05 '24 15:09 mikemccand