valhalla icon indicating copy to clipboard operation
valhalla copied to clipboard

8341003: [lworld+fp16] Benchmarks for various Float16 operations

Open jatin-bhateja opened this issue 1 year ago • 3 comments

  • Adding micro-benchmarks for various Float16 operations.
  • Adding similarity search targeting micro-benchmarks.

Please find below the results of performance testing over Intel Xeon6 Granite Rapids:-

Benchmark                                               (vectorDim)   Mode  Cnt      Score   Error   Units
Float16OpsBenchmark.absBenchmark                               1024  thrpt    2  25605.990          ops/ms
Float16OpsBenchmark.addBenchmark                               1024  thrpt    2  19222.468          ops/ms
Float16OpsBenchmark.cosineSimilarityDequantizedFP16            1024  thrpt    2    528.738          ops/ms
Float16OpsBenchmark.cosineSimilarityDoubleRoundingFP16         1024  thrpt    2    660.018          ops/ms
Float16OpsBenchmark.cosineSimilaritySingleRoundingFP16         1024  thrpt    2    659.799          ops/ms
Float16OpsBenchmark.divBenchmark                               1024  thrpt    2   1974.039          ops/ms
Float16OpsBenchmark.euclideanDistanceDequantizedFP16           1024  thrpt    2    743.071          ops/ms
Float16OpsBenchmark.euclideanDistanceFP16                      1024  thrpt    2    682.440          ops/ms
Float16OpsBenchmark.fmaBenchmark                               1024  thrpt    2  14052.422          ops/ms
Float16OpsBenchmark.isFiniteBenchmark                          1024  thrpt    2   3851.234          ops/ms
Float16OpsBenchmark.isInfiniteBenchmark                        1024  thrpt    2   1496.207          ops/ms
Float16OpsBenchmark.isNaNBenchmark                             1024  thrpt    2   2778.822          ops/ms
Float16OpsBenchmark.maxBenchmark                               1024  thrpt    2  19231.326          ops/ms
Float16OpsBenchmark.minBenchmark                               1024  thrpt    2  19257.589          ops/ms
Float16OpsBenchmark.mulBenchmark                               1024  thrpt    2  19236.498          ops/ms
Float16OpsBenchmark.negateBenchmark                            1024  thrpt    2  25938.789          ops/ms
Float16OpsBenchmark.sqrtBenchmark                              1024  thrpt    2   1759.051          ops/ms
Float16OpsBenchmark.subBenchmark                               1024  thrpt    2  19242.967          ops/ms

Best Regrads, Jatin


Progress

  • [x] Change must not contain extraneous whitespace

Issue

  • JDK-8341003: [lworld+fp16] Benchmarks for various Float16 operations (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/valhalla.git pull/1254/head:pull/1254
$ git checkout pull/1254

Update a local copy of the PR:
$ git checkout pull/1254
$ git pull https://git.openjdk.org/valhalla.git pull/1254/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 1254

View PR using the GUI difftool:
$ git pr show -t 1254

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/valhalla/pull/1254.diff

Webrev

Link to Webrev Comment

jatin-bhateja avatar Sep 26 '24 08:09 jatin-bhateja

:wave: Welcome back jbhateja! A progress list of the required criteria for merging this PR into lworld+fp16 will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar Sep 26 '24 08:09 bridgekeeper[bot]

@jatin-bhateja This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8341003: [lworld+fp16] Benchmarks for various Float16 operations

Reviewed-by: bkilambi

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 1 new commit pushed to the lworld+fp16 branch:

  • fb5b1b16c57266508afd96f6a78f90996d28017f: 8341005: [lworld+fp16] Disable intrinsification of Float16.abs and Float16.negate for x86 target

Please see this link for an up-to-date comparison between the source branch of this pull request and the lworld+fp16 branch. As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the lworld+fp16 branch, type /integrate in a new comment.

openjdk[bot] avatar Sep 26 '24 08:09 openjdk[bot]

Webrevs

mlbridge[bot] avatar Sep 26 '24 08:09 mlbridge[bot]

Hi @Bhavana-Kilambi , I see vector IR in almost all the micros apart from three i.e. isNaN, isFinite and isInfinity with following command

numactl --cpunodebind=1 -l java -jar target/benchmarks.jar -jvmArgs "-XX:+TraceNewVectors" -p vectorDim=512 -f 1 -i 2 -wi 1 -w 30 org.openjdk.bench.java.lang.Float16OpsBenchmark.<BM_NAME>

Indicates Java implementation in those cases is not getting auto-vectorized, we didn't had benchmarks earlier, after tuning we can verify with this new benchmark.

Kindly let me know if the micro looks good, I can integrate it.

jatin-bhateja avatar Sep 26 '24 10:09 jatin-bhateja

Hi @jatin-bhateja , thanks for doing the micros. Can I please ask why are you benchmarking/testing the cosine similarity tests specifically? Are there any real world usecases which are similar to these for FP16 for which you have written these smaller benchmark kernels?

Also, regarding the performance results you posted for the Intel machine, have you compared it with anything else (like the default FP32 implementation for FP16/case without the intrinsics or the scalar FP16 version) so that we can better interpret the scores?

Bhavana-Kilambi avatar Sep 26 '24 15:09 Bhavana-Kilambi

Hi @jatin-bhateja , thanks for doing the micros. Can I please ask why are you benchmarking/testing the cosine similarity tests specifically? Are there any real world usecases which are similar to these for FP16 for which you have written these smaller benchmark kernels?

Also, regarding the performance results you posted for the Intel machine, have you compared it with anything else (like the default FP32 implementation for FP16/case without the intrinsics or the scalar FP16 version) so that we can better interpret the scores?

Hi @Bhavana-Kilambi , This patch adds micro benchmarks for all Float16 APIs optimized uptill now. Macro-benchmarks demonstrates use case for low precision semantic search primitives.

jatin-bhateja avatar Sep 27 '24 07:09 jatin-bhateja

@jatin-bhateja , Thanks! While we are at the topic, can I ask if there are any real-world usescases or workloads that you are targeting the FP16 work for and maybe plan to do performance testing in the future?

Bhavana-Kilambi avatar Sep 27 '24 07:09 Bhavana-Kilambi

Hi @jatin-bhateja , thanks for doing the micros. Can I please ask why are you benchmarking/testing the cosine similarity tests specifically? Are there any real world usecases which are similar to these for FP16 for which you have written these smaller benchmark kernels? Also, regarding the performance results you posted for the Intel machine, have you compared it with anything else (like the default FP32 implementation for FP16/case without the intrinsics or the scalar FP16 version) so that we can better interpret the scores?

Hi @Bhavana-Kilambi , This patch adds micro benchmarks for all Float16 APIs optimized uptill now. Macro-benchmarks demonstrates use case for low precision semantic search primitives.

Hey, for baseline we should not pass --enable-preview since it will prohibit following

  • Flat layout of Float16 arrays.
  • Creating valhalla specific IR needed for intrinsification.

Here are the first baseline numbers without --enable-primitive.


Benchmark                                               (vectorDim)   Mode  Cnt     Score   Error   Units
Float16OpsBenchmark.absBenchmark                               1024  thrpt    2    99.424          ops/ms
Float16OpsBenchmark.addBenchmark                               1024  thrpt    2    97.498          ops/ms
Float16OpsBenchmark.cosineSimilarityDequantizedFP16            1024  thrpt    2   525.360          ops/ms
Float16OpsBenchmark.cosineSimilarityDoubleRoundingFP16         1024  thrpt    2    51.132          ops/ms
Float16OpsBenchmark.cosineSimilaritySingleRoundingFP16         1024  thrpt    2    46.921          ops/ms
Float16OpsBenchmark.divBenchmark                               1024  thrpt    2    97.186          ops/ms
Float16OpsBenchmark.euclideanDistanceDequantizedFP16           1024  thrpt    2   583.051          ops/ms
Float16OpsBenchmark.euclideanDistanceFP16                      1024  thrpt    2    56.133          ops/ms
Float16OpsBenchmark.fmaBenchmark                               1024  thrpt    2    81.386          ops/ms
Float16OpsBenchmark.getExponentBenchmark                       1024  thrpt    2  2257.619          ops/ms
Float16OpsBenchmark.isFiniteBenchmark                          1024  thrpt    2  3086.476          ops/ms
Float16OpsBenchmark.isInfiniteBenchmark                        1024  thrpt    2  1718.411          ops/ms
Float16OpsBenchmark.isNaNBenchmark                             1024  thrpt    2  1685.557          ops/ms
Float16OpsBenchmark.maxBenchmark                               1024  thrpt    2    92.078          ops/ms
Float16OpsBenchmark.minBenchmark                               1024  thrpt    2    63.377          ops/ms
Float16OpsBenchmark.mulBenchmark                               1024  thrpt    2    98.202          ops/ms
Float16OpsBenchmark.negateBenchmark                            1024  thrpt    2    98.158          ops/ms
Float16OpsBenchmark.sqrtBenchmark                              1024  thrpt    2    83.760          ops/ms
Float16OpsBenchmark.subBenchmark                               1024  thrpt    2    98.200          ops/ms

Following are the number where we do allow flat array layout, but only disable intrinsics (-XX:DisableIntrinsic=<INTIN_ID>+).


Benchmark                                               (vectorDim)   Mode  Cnt      Score   Error   Units
Float16OpsBenchmark.absBenchmark                               1024  thrpt    2  25978.876          ops/ms
Float16OpsBenchmark.addBenchmark                               1024  thrpt    2   6406.685          ops/ms
Float16OpsBenchmark.cosineSimilarityDequantizedFP16            1024  thrpt    2    528.877          ops/ms
Float16OpsBenchmark.cosineSimilarityDoubleRoundingFP16         1024  thrpt    2     76.680          ops/ms
Float16OpsBenchmark.cosineSimilaritySingleRoundingFP16         1024  thrpt    2     53.692          ops/ms
Float16OpsBenchmark.divBenchmark                               1024  thrpt    2   3227.037          ops/ms
Float16OpsBenchmark.euclideanDistanceDequantizedFP16           1024  thrpt    2    740.490          ops/ms
Float16OpsBenchmark.euclideanDistanceFP16                      1024  thrpt    2     83.747          ops/ms
Float16OpsBenchmark.fmaBenchmark                               1024  thrpt    2    256.399          ops/ms
Float16OpsBenchmark.getExponentBenchmark                       1024  thrpt    2   2135.678          ops/ms
Float16OpsBenchmark.isFiniteBenchmark                          1024  thrpt    2   3916.860          ops/ms
Float16OpsBenchmark.isInfiniteBenchmark                        1024  thrpt    2   1497.417          ops/ms
Float16OpsBenchmark.isNaNBenchmark                             1024  thrpt    2   2747.704          ops/ms
Float16OpsBenchmark.maxBenchmark                               1024  thrpt    2   3625.708          ops/ms
Float16OpsBenchmark.minBenchmark                               1024  thrpt    2   3628.261          ops/ms
Float16OpsBenchmark.mulBenchmark                               1024  thrpt    2   6340.403          ops/ms
Float16OpsBenchmark.negateBenchmark                            1024  thrpt    2  25727.870          ops/ms
Float16OpsBenchmark.sqrtBenchmark                              1024  thrpt    2    157.519          ops/ms
Float16OpsBenchmark.subBenchmark                               1024  thrpt    2   6404.047          ops/ms

jatin-bhateja avatar Sep 27 '24 07:09 jatin-bhateja

@jatin-bhateja , Thanks! While we are at the topic, can I ask if there are any real-world usescases or workloads that you are targeting the FP16 work for and maybe plan to do performance testing in the future?

Hey, we have some ideas, but for now my intent is to add micros/few demonstrating macro for each API we have accelerated.

jatin-bhateja avatar Sep 27 '24 07:09 jatin-bhateja

Thanks for sharing the numbers. So in the first case, without the --enable-preview flag, it would have generated only scalar FP32 operations and in the second case where it is allowed to have flat array layout, it generates vector instructions but for FP32 right?

Bhavana-Kilambi avatar Sep 27 '24 08:09 Bhavana-Kilambi

Thanks for sharing the numbers. So in the first case, without the --enable-preview flag, it would have generated only scalar FP32 operations and in the second case where it is allowed to have flat array layout, it generates vector instructions but for FP32 right?

Yes.

Let me know if you have other comments on micros, or kindly approve if its good to integrate.

jatin-bhateja avatar Sep 27 '24 08:09 jatin-bhateja

I am just running the tests on one of our machines. Can I just confirm in a while please? The tests otherwise look fine to me..

Bhavana-Kilambi avatar Sep 27 '24 08:09 Bhavana-Kilambi

btw are you generating min instruction for max and max instruction for min in c2_MacroAssembler_x86.cpp ?

Bhavana-Kilambi avatar Sep 27 '24 09:09 Bhavana-Kilambi

btw are you generating min instruction for max and max instruction for min in c2_MacroAssembler_x86.cpp ?

My bad, good catch, Thanks!

jatin-bhateja avatar Sep 27 '24 19:09 jatin-bhateja

/integrate

jatin-bhateja avatar Sep 27 '24 19:09 jatin-bhateja

Going to push as commit 0ce9f0fa94c1cd66aeb8ae4763ef59a1d6841dc9. Since your change was applied there has been 1 commit pushed to the lworld+fp16 branch:

  • fb5b1b16c57266508afd96f6a78f90996d28017f: 8341005: [lworld+fp16] Disable intrinsification of Float16.abs and Float16.negate for x86 target

Your commit was automatically rebased without conflicts.

openjdk[bot] avatar Sep 27 '24 19:09 openjdk[bot]

@jatin-bhateja Pushed as commit 0ce9f0fa94c1cd66aeb8ae4763ef59a1d6841dc9.

:bulb: You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

openjdk[bot] avatar Sep 27 '24 19:09 openjdk[bot]