valhalla 8341003: [lworld+fp16] Benchmarks for various Float16 operations

Adding micro-benchmarks for various Float16 operations.
Adding similarity search targeting micro-benchmarks.

Please find below the results of performance testing over Intel Xeon6 Granite Rapids:-

Benchmark                                               (vectorDim)   Mode  Cnt      Score   Error   Units
Float16OpsBenchmark.absBenchmark                               1024  thrpt    2  25605.990          ops/ms
Float16OpsBenchmark.addBenchmark                               1024  thrpt    2  19222.468          ops/ms
Float16OpsBenchmark.cosineSimilarityDequantizedFP16            1024  thrpt    2    528.738          ops/ms
Float16OpsBenchmark.cosineSimilarityDoubleRoundingFP16         1024  thrpt    2    660.018          ops/ms
Float16OpsBenchmark.cosineSimilaritySingleRoundingFP16         1024  thrpt    2    659.799          ops/ms
Float16OpsBenchmark.divBenchmark                               1024  thrpt    2   1974.039          ops/ms
Float16OpsBenchmark.euclideanDistanceDequantizedFP16           1024  thrpt    2    743.071          ops/ms
Float16OpsBenchmark.euclideanDistanceFP16                      1024  thrpt    2    682.440          ops/ms
Float16OpsBenchmark.fmaBenchmark                               1024  thrpt    2  14052.422          ops/ms
Float16OpsBenchmark.isFiniteBenchmark                          1024  thrpt    2   3851.234          ops/ms
Float16OpsBenchmark.isInfiniteBenchmark                        1024  thrpt    2   1496.207          ops/ms
Float16OpsBenchmark.isNaNBenchmark                             1024  thrpt    2   2778.822          ops/ms
Float16OpsBenchmark.maxBenchmark                               1024  thrpt    2  19231.326          ops/ms
Float16OpsBenchmark.minBenchmark                               1024  thrpt    2  19257.589          ops/ms
Float16OpsBenchmark.mulBenchmark                               1024  thrpt    2  19236.498          ops/ms
Float16OpsBenchmark.negateBenchmark                            1024  thrpt    2  25938.789          ops/ms
Float16OpsBenchmark.sqrtBenchmark                              1024  thrpt    2   1759.051          ops/ms
Float16OpsBenchmark.subBenchmark                               1024  thrpt    2  19242.967          ops/ms

Best Regrads, Jatin

Progress

[x] Change must not contain extraneous whitespace

Issue

JDK-8341003: [lworld+fp16] Benchmarks for various Float16 operations (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/valhalla.git pull/1254/head:pull/1254
$ git checkout pull/1254

Update a local copy of the PR:
$ git checkout pull/1254
$ git pull https://git.openjdk.org/valhalla.git pull/1254/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 1254

View PR using the GUI difftool:
$ git pr show -t 1254

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/valhalla/pull/1254.diff

Webrev

Link to Webrev Comment

Sep 26 '24 08:09 jatin-bhateja

:wave: Welcome back jbhateja! A progress list of the required criteria for merging this PR into lworld+fp16 will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Sep 26 '24 08:09 bridgekeeper[bot]

@jatin-bhateja This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8341003: [lworld+fp16] Benchmarks for various Float16 operations

Reviewed-by: bkilambi

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 1 new commit pushed to the lworld+fp16 branch:

fb5b1b16c57266508afd96f6a78f90996d28017f: 8341005: [lworld+fp16] Disable intrinsification of Float16.abs and Float16.negate for x86 target

Please see this link for an up-to-date comparison between the source branch of this pull request and the lworld+fp16 branch. As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the lworld+fp16 branch, type /integrate in a new comment.

Sep 26 '24 08:09 openjdk[bot]

Webrevs

01: Full - Incremental (6c988b7c)
00: Full (1bfdd3c4)

Sep 26 '24 08:09 mlbridge[bot]

Hi @Bhavana-Kilambi , I see vector IR in almost all the micros apart from three i.e. isNaN, isFinite and isInfinity with following command

numactl --cpunodebind=1 -l java -jar target/benchmarks.jar -jvmArgs "-XX:+TraceNewVectors" -p vectorDim=512 -f 1 -i 2 -wi 1 -w 30 org.openjdk.bench.java.lang.Float16OpsBenchmark.<BM_NAME>

Indicates Java implementation in those cases is not getting auto-vectorized, we didn't had benchmarks earlier, after tuning we can verify with this new benchmark.

Kindly let me know if the micro looks good, I can integrate it.

Sep 26 '24 10:09 jatin-bhateja

Hi @jatin-bhateja , thanks for doing the micros. Can I please ask why are you benchmarking/testing the cosine similarity tests specifically? Are there any real world usecases which are similar to these for FP16 for which you have written these smaller benchmark kernels?

Also, regarding the performance results you posted for the Intel machine, have you compared it with anything else (like the default FP32 implementation for FP16/case without the intrinsics or the scalar FP16 version) so that we can better interpret the scores?

Sep 26 '24 15:09 Bhavana-Kilambi

Hi @jatin-bhateja , thanks for doing the micros. Can I please ask why are you benchmarking/testing the cosine similarity tests specifically? Are there any real world usecases which are similar to these for FP16 for which you have written these smaller benchmark kernels?

Also, regarding the performance results you posted for the Intel machine, have you compared it with anything else (like the default FP32 implementation for FP16/case without the intrinsics or the scalar FP16 version) so that we can better interpret the scores?

Hi @Bhavana-Kilambi , This patch adds micro benchmarks for all Float16 APIs optimized uptill now. Macro-benchmarks demonstrates use case for low precision semantic search primitives.

Sep 27 '24 07:09 jatin-bhateja

@jatin-bhateja , Thanks! While we are at the topic, can I ask if there are any real-world usescases or workloads that you are targeting the FP16 work for and maybe plan to do performance testing in the future?

Sep 27 '24 07:09 Bhavana-Kilambi

Hi @jatin-bhateja , thanks for doing the micros. Can I please ask why are you benchmarking/testing the cosine similarity tests specifically? Are there any real world usecases which are similar to these for FP16 for which you have written these smaller benchmark kernels? Also, regarding the performance results you posted for the Intel machine, have you compared it with anything else (like the default FP32 implementation for FP16/case without the intrinsics or the scalar FP16 version) so that we can better interpret the scores?

Hi @Bhavana-Kilambi , This patch adds micro benchmarks for all Float16 APIs optimized uptill now. Macro-benchmarks demonstrates use case for low precision semantic search primitives.

Hey, for baseline we should not pass --enable-preview since it will prohibit following

Flat layout of Float16 arrays.
Creating valhalla specific IR needed for intrinsification.

Here are the first baseline numbers without --enable-primitive.


Benchmark                                               (vectorDim)   Mode  Cnt     Score   Error   Units
Float16OpsBenchmark.absBenchmark                               1024  thrpt    2    99.424          ops/ms
Float16OpsBenchmark.addBenchmark                               1024  thrpt    2    97.498          ops/ms
Float16OpsBenchmark.cosineSimilarityDequantizedFP16            1024  thrpt    2   525.360          ops/ms
Float16OpsBenchmark.cosineSimilarityDoubleRoundingFP16         1024  thrpt    2    51.132          ops/ms
Float16OpsBenchmark.cosineSimilaritySingleRoundingFP16         1024  thrpt    2    46.921          ops/ms
Float16OpsBenchmark.divBenchmark                               1024  thrpt    2    97.186          ops/ms
Float16OpsBenchmark.euclideanDistanceDequantizedFP16           1024  thrpt    2   583.051          ops/ms
Float16OpsBenchmark.euclideanDistanceFP16                      1024  thrpt    2    56.133          ops/ms
Float16OpsBenchmark.fmaBenchmark                               1024  thrpt    2    81.386          ops/ms
Float16OpsBenchmark.getExponentBenchmark                       1024  thrpt    2  2257.619          ops/ms
Float16OpsBenchmark.isFiniteBenchmark                          1024  thrpt    2  3086.476          ops/ms
Float16OpsBenchmark.isInfiniteBenchmark                        1024  thrpt    2  1718.411          ops/ms
Float16OpsBenchmark.isNaNBenchmark                             1024  thrpt    2  1685.557          ops/ms
Float16OpsBenchmark.maxBenchmark                               1024  thrpt    2    92.078          ops/ms
Float16OpsBenchmark.minBenchmark                               1024  thrpt    2    63.377          ops/ms
Float16OpsBenchmark.mulBenchmark                               1024  thrpt    2    98.202          ops/ms
Float16OpsBenchmark.negateBenchmark                            1024  thrpt    2    98.158          ops/ms
Float16OpsBenchmark.sqrtBenchmark                              1024  thrpt    2    83.760          ops/ms
Float16OpsBenchmark.subBenchmark                               1024  thrpt    2    98.200          ops/ms

Following are the number where we do allow flat array layout, but only disable intrinsics (-XX:DisableIntrinsic=<INTIN_ID>+).


Benchmark                                               (vectorDim)   Mode  Cnt      Score   Error   Units
Float16OpsBenchmark.absBenchmark                               1024  thrpt    2  25978.876          ops/ms
Float16OpsBenchmark.addBenchmark                               1024  thrpt    2   6406.685          ops/ms
Float16OpsBenchmark.cosineSimilarityDequantizedFP16            1024  thrpt    2    528.877          ops/ms
Float16OpsBenchmark.cosineSimilarityDoubleRoundingFP16         1024  thrpt    2     76.680          ops/ms
Float16OpsBenchmark.cosineSimilaritySingleRoundingFP16         1024  thrpt    2     53.692          ops/ms
Float16OpsBenchmark.divBenchmark                               1024  thrpt    2   3227.037          ops/ms
Float16OpsBenchmark.euclideanDistanceDequantizedFP16           1024  thrpt    2    740.490          ops/ms
Float16OpsBenchmark.euclideanDistanceFP16                      1024  thrpt    2     83.747          ops/ms
Float16OpsBenchmark.fmaBenchmark                               1024  thrpt    2    256.399          ops/ms
Float16OpsBenchmark.getExponentBenchmark                       1024  thrpt    2   2135.678          ops/ms
Float16OpsBenchmark.isFiniteBenchmark                          1024  thrpt    2   3916.860          ops/ms
Float16OpsBenchmark.isInfiniteBenchmark                        1024  thrpt    2   1497.417          ops/ms
Float16OpsBenchmark.isNaNBenchmark                             1024  thrpt    2   2747.704          ops/ms
Float16OpsBenchmark.maxBenchmark                               1024  thrpt    2   3625.708          ops/ms
Float16OpsBenchmark.minBenchmark                               1024  thrpt    2   3628.261          ops/ms
Float16OpsBenchmark.mulBenchmark                               1024  thrpt    2   6340.403          ops/ms
Float16OpsBenchmark.negateBenchmark                            1024  thrpt    2  25727.870          ops/ms
Float16OpsBenchmark.sqrtBenchmark                              1024  thrpt    2    157.519          ops/ms
Float16OpsBenchmark.subBenchmark                               1024  thrpt    2   6404.047          ops/ms

Sep 27 '24 07:09 jatin-bhateja

@jatin-bhateja , Thanks! While we are at the topic, can I ask if there are any real-world usescases or workloads that you are targeting the FP16 work for and maybe plan to do performance testing in the future?

Hey, we have some ideas, but for now my intent is to add micros/few demonstrating macro for each API we have accelerated.

Sep 27 '24 07:09 jatin-bhateja

Thanks for sharing the numbers. So in the first case, without the --enable-preview flag, it would have generated only scalar FP32 operations and in the second case where it is allowed to have flat array layout, it generates vector instructions but for FP32 right?

Sep 27 '24 08:09 Bhavana-Kilambi

Thanks for sharing the numbers. So in the first case, without the --enable-preview flag, it would have generated only scalar FP32 operations and in the second case where it is allowed to have flat array layout, it generates vector instructions but for FP32 right?

Yes.

Let me know if you have other comments on micros, or kindly approve if its good to integrate.

Sep 27 '24 08:09 jatin-bhateja

I am just running the tests on one of our machines. Can I just confirm in a while please? The tests otherwise look fine to me..

Sep 27 '24 08:09 Bhavana-Kilambi

btw are you generating min instruction for max and max instruction for min in c2_MacroAssembler_x86.cpp ?

Sep 27 '24 09:09 Bhavana-Kilambi

btw are you generating min instruction for max and max instruction for min in c2_MacroAssembler_x86.cpp ?

My bad, good catch, Thanks!

Sep 27 '24 19:09 jatin-bhateja

/integrate

Sep 27 '24 19:09 jatin-bhateja

Going to push as commit 0ce9f0fa94c1cd66aeb8ae4763ef59a1d6841dc9. Since your change was applied there has been 1 commit pushed to the lworld+fp16 branch:

fb5b1b16c57266508afd96f6a78f90996d28017f: 8341005: [lworld+fp16] Disable intrinsification of Float16.abs and Float16.negate for x86 target

Your commit was automatically rebased without conflicts.

Sep 27 '24 19:09 openjdk[bot]

@jatin-bhateja Pushed as commit 0ce9f0fa94c1cd66aeb8ae4763ef59a1d6841dc9.

:bulb: You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sep 27 '24 19:09 openjdk[bot]

valhalla valhalla copied to clipboard

8341003: [lworld+fp16] Benchmarks for various Float16 operations

Progress

Issue

Reviewing

Webrev

Webrevs

valhalla
valhalla copied to clipboard