valhalla
valhalla copied to clipboard
8341003: [lworld+fp16] Benchmarks for various Float16 operations
- Adding micro-benchmarks for various Float16 operations.
- Adding similarity search targeting micro-benchmarks.
Please find below the results of performance testing over Intel Xeon6 Granite Rapids:-
Benchmark (vectorDim) Mode Cnt Score Error Units
Float16OpsBenchmark.absBenchmark 1024 thrpt 2 25605.990 ops/ms
Float16OpsBenchmark.addBenchmark 1024 thrpt 2 19222.468 ops/ms
Float16OpsBenchmark.cosineSimilarityDequantizedFP16 1024 thrpt 2 528.738 ops/ms
Float16OpsBenchmark.cosineSimilarityDoubleRoundingFP16 1024 thrpt 2 660.018 ops/ms
Float16OpsBenchmark.cosineSimilaritySingleRoundingFP16 1024 thrpt 2 659.799 ops/ms
Float16OpsBenchmark.divBenchmark 1024 thrpt 2 1974.039 ops/ms
Float16OpsBenchmark.euclideanDistanceDequantizedFP16 1024 thrpt 2 743.071 ops/ms
Float16OpsBenchmark.euclideanDistanceFP16 1024 thrpt 2 682.440 ops/ms
Float16OpsBenchmark.fmaBenchmark 1024 thrpt 2 14052.422 ops/ms
Float16OpsBenchmark.isFiniteBenchmark 1024 thrpt 2 3851.234 ops/ms
Float16OpsBenchmark.isInfiniteBenchmark 1024 thrpt 2 1496.207 ops/ms
Float16OpsBenchmark.isNaNBenchmark 1024 thrpt 2 2778.822 ops/ms
Float16OpsBenchmark.maxBenchmark 1024 thrpt 2 19231.326 ops/ms
Float16OpsBenchmark.minBenchmark 1024 thrpt 2 19257.589 ops/ms
Float16OpsBenchmark.mulBenchmark 1024 thrpt 2 19236.498 ops/ms
Float16OpsBenchmark.negateBenchmark 1024 thrpt 2 25938.789 ops/ms
Float16OpsBenchmark.sqrtBenchmark 1024 thrpt 2 1759.051 ops/ms
Float16OpsBenchmark.subBenchmark 1024 thrpt 2 19242.967 ops/ms
Best Regrads, Jatin
Progress
- [x] Change must not contain extraneous whitespace
Issue
- JDK-8341003: [lworld+fp16] Benchmarks for various Float16 operations (Enhancement - P4)
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/valhalla.git pull/1254/head:pull/1254
$ git checkout pull/1254
Update a local copy of the PR:
$ git checkout pull/1254
$ git pull https://git.openjdk.org/valhalla.git pull/1254/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 1254
View PR using the GUI difftool:
$ git pr show -t 1254
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/valhalla/pull/1254.diff
Webrev
:wave: Welcome back jbhateja! A progress list of the required criteria for merging this PR into lworld+fp16 will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
@jatin-bhateja This change now passes all automated pre-integration checks.
ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.
After integration, the commit message for the final commit will be:
8341003: [lworld+fp16] Benchmarks for various Float16 operations
Reviewed-by: bkilambi
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.
At the time when this comment was updated there had been 1 new commit pushed to the lworld+fp16 branch:
- fb5b1b16c57266508afd96f6a78f90996d28017f: 8341005: [lworld+fp16] Disable intrinsification of Float16.abs and Float16.negate for x86 target
Please see this link for an up-to-date comparison between the source branch of this pull request and the lworld+fp16 branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.
➡️ To integrate this PR with the above commit message to the lworld+fp16 branch, type /integrate in a new comment.
Hi @Bhavana-Kilambi , I see vector IR in almost all the micros apart from three i.e. isNaN, isFinite and isInfinity with following command
numactl --cpunodebind=1 -l java -jar target/benchmarks.jar -jvmArgs "-XX:+TraceNewVectors" -p vectorDim=512 -f 1 -i 2 -wi 1 -w 30 org.openjdk.bench.java.lang.Float16OpsBenchmark.<BM_NAME>
Indicates Java implementation in those cases is not getting auto-vectorized, we didn't had benchmarks earlier, after tuning we can verify with this new benchmark.
Kindly let me know if the micro looks good, I can integrate it.
Hi @jatin-bhateja , thanks for doing the micros. Can I please ask why are you benchmarking/testing the cosine similarity tests specifically? Are there any real world usecases which are similar to these for FP16 for which you have written these smaller benchmark kernels?
Also, regarding the performance results you posted for the Intel machine, have you compared it with anything else (like the default FP32 implementation for FP16/case without the intrinsics or the scalar FP16 version) so that we can better interpret the scores?
Hi @jatin-bhateja , thanks for doing the micros. Can I please ask why are you benchmarking/testing the cosine similarity tests specifically? Are there any real world usecases which are similar to these for FP16 for which you have written these smaller benchmark kernels?
Also, regarding the performance results you posted for the Intel machine, have you compared it with anything else (like the default FP32 implementation for FP16/case without the intrinsics or the scalar FP16 version) so that we can better interpret the scores?
Hi @Bhavana-Kilambi , This patch adds micro benchmarks for all Float16 APIs optimized uptill now. Macro-benchmarks demonstrates use case for low precision semantic search primitives.
@jatin-bhateja , Thanks! While we are at the topic, can I ask if there are any real-world usescases or workloads that you are targeting the FP16 work for and maybe plan to do performance testing in the future?
Hi @jatin-bhateja , thanks for doing the micros. Can I please ask why are you benchmarking/testing the cosine similarity tests specifically? Are there any real world usecases which are similar to these for FP16 for which you have written these smaller benchmark kernels? Also, regarding the performance results you posted for the Intel machine, have you compared it with anything else (like the default FP32 implementation for FP16/case without the intrinsics or the scalar FP16 version) so that we can better interpret the scores?
Hi @Bhavana-Kilambi , This patch adds micro benchmarks for all Float16 APIs optimized uptill now. Macro-benchmarks demonstrates use case for low precision semantic search primitives.
Hey, for baseline we should not pass --enable-preview since it will prohibit following
- Flat layout of Float16 arrays.
- Creating valhalla specific IR needed for intrinsification.
Here are the first baseline numbers without --enable-primitive.
Benchmark (vectorDim) Mode Cnt Score Error Units
Float16OpsBenchmark.absBenchmark 1024 thrpt 2 99.424 ops/ms
Float16OpsBenchmark.addBenchmark 1024 thrpt 2 97.498 ops/ms
Float16OpsBenchmark.cosineSimilarityDequantizedFP16 1024 thrpt 2 525.360 ops/ms
Float16OpsBenchmark.cosineSimilarityDoubleRoundingFP16 1024 thrpt 2 51.132 ops/ms
Float16OpsBenchmark.cosineSimilaritySingleRoundingFP16 1024 thrpt 2 46.921 ops/ms
Float16OpsBenchmark.divBenchmark 1024 thrpt 2 97.186 ops/ms
Float16OpsBenchmark.euclideanDistanceDequantizedFP16 1024 thrpt 2 583.051 ops/ms
Float16OpsBenchmark.euclideanDistanceFP16 1024 thrpt 2 56.133 ops/ms
Float16OpsBenchmark.fmaBenchmark 1024 thrpt 2 81.386 ops/ms
Float16OpsBenchmark.getExponentBenchmark 1024 thrpt 2 2257.619 ops/ms
Float16OpsBenchmark.isFiniteBenchmark 1024 thrpt 2 3086.476 ops/ms
Float16OpsBenchmark.isInfiniteBenchmark 1024 thrpt 2 1718.411 ops/ms
Float16OpsBenchmark.isNaNBenchmark 1024 thrpt 2 1685.557 ops/ms
Float16OpsBenchmark.maxBenchmark 1024 thrpt 2 92.078 ops/ms
Float16OpsBenchmark.minBenchmark 1024 thrpt 2 63.377 ops/ms
Float16OpsBenchmark.mulBenchmark 1024 thrpt 2 98.202 ops/ms
Float16OpsBenchmark.negateBenchmark 1024 thrpt 2 98.158 ops/ms
Float16OpsBenchmark.sqrtBenchmark 1024 thrpt 2 83.760 ops/ms
Float16OpsBenchmark.subBenchmark 1024 thrpt 2 98.200 ops/ms
Following are the number where we do allow flat array layout, but only disable intrinsics (-XX:DisableIntrinsic=<INTIN_ID>+).
Benchmark (vectorDim) Mode Cnt Score Error Units
Float16OpsBenchmark.absBenchmark 1024 thrpt 2 25978.876 ops/ms
Float16OpsBenchmark.addBenchmark 1024 thrpt 2 6406.685 ops/ms
Float16OpsBenchmark.cosineSimilarityDequantizedFP16 1024 thrpt 2 528.877 ops/ms
Float16OpsBenchmark.cosineSimilarityDoubleRoundingFP16 1024 thrpt 2 76.680 ops/ms
Float16OpsBenchmark.cosineSimilaritySingleRoundingFP16 1024 thrpt 2 53.692 ops/ms
Float16OpsBenchmark.divBenchmark 1024 thrpt 2 3227.037 ops/ms
Float16OpsBenchmark.euclideanDistanceDequantizedFP16 1024 thrpt 2 740.490 ops/ms
Float16OpsBenchmark.euclideanDistanceFP16 1024 thrpt 2 83.747 ops/ms
Float16OpsBenchmark.fmaBenchmark 1024 thrpt 2 256.399 ops/ms
Float16OpsBenchmark.getExponentBenchmark 1024 thrpt 2 2135.678 ops/ms
Float16OpsBenchmark.isFiniteBenchmark 1024 thrpt 2 3916.860 ops/ms
Float16OpsBenchmark.isInfiniteBenchmark 1024 thrpt 2 1497.417 ops/ms
Float16OpsBenchmark.isNaNBenchmark 1024 thrpt 2 2747.704 ops/ms
Float16OpsBenchmark.maxBenchmark 1024 thrpt 2 3625.708 ops/ms
Float16OpsBenchmark.minBenchmark 1024 thrpt 2 3628.261 ops/ms
Float16OpsBenchmark.mulBenchmark 1024 thrpt 2 6340.403 ops/ms
Float16OpsBenchmark.negateBenchmark 1024 thrpt 2 25727.870 ops/ms
Float16OpsBenchmark.sqrtBenchmark 1024 thrpt 2 157.519 ops/ms
Float16OpsBenchmark.subBenchmark 1024 thrpt 2 6404.047 ops/ms
@jatin-bhateja , Thanks! While we are at the topic, can I ask if there are any real-world usescases or workloads that you are targeting the FP16 work for and maybe plan to do performance testing in the future?
Hey, we have some ideas, but for now my intent is to add micros/few demonstrating macro for each API we have accelerated.
Thanks for sharing the numbers. So in the first case, without the --enable-preview flag, it would have generated only scalar FP32 operations and in the second case where it is allowed to have flat array layout, it generates vector instructions but for FP32 right?
Thanks for sharing the numbers. So in the first case, without the
--enable-previewflag, it would have generated only scalar FP32 operations and in the second case where it is allowed to have flat array layout, it generates vector instructions but for FP32 right?
Yes.
Let me know if you have other comments on micros, or kindly approve if its good to integrate.
I am just running the tests on one of our machines. Can I just confirm in a while please? The tests otherwise look fine to me..
btw are you generating min instruction for max and max instruction for min in c2_MacroAssembler_x86.cpp ?
btw are you generating min instruction for max and max instruction for min in
c2_MacroAssembler_x86.cpp?
My bad, good catch, Thanks!
/integrate
Going to push as commit 0ce9f0fa94c1cd66aeb8ae4763ef59a1d6841dc9.
Since your change was applied there has been 1 commit pushed to the lworld+fp16 branch:
- fb5b1b16c57266508afd96f6a78f90996d28017f: 8341005: [lworld+fp16] Disable intrinsification of Float16.abs and Float16.negate for x86 target
Your commit was automatically rebased without conflicts.
@jatin-bhateja Pushed as commit 0ce9f0fa94c1cd66aeb8ae4763ef59a1d6841dc9.
:bulb: You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.