panama-vector
panama-vector copied to clipboard
8365967: C2 compiler support for HalffloatVector operations supported by auto-vectorization flow
Hi All,
This patch extends VectorAPI inline expanders to infer Float16 vector IR based on the newly passed operType argument. We intend to leverage the existing IR and backend implementation of auto-vectorized Float16 operations. Various HalffloatVector operators, namely ADD, SUB, MUL, DIV, MAX, MIN, and FMA, now emit FP16 ISA on x86 targets supporting AVX512-FP16 feature and AArch64 SVE targets.
Best Regards, Jatin
Progress
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
- [ ] Change must be properly reviewed (1 review required, with at least 1 Committer)
Issue
- JDK-8365967: C2 compiler support for HalffloatVector operations supported by auto-vectorization flow (Enhancement - P4)
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/panama-vector.git pull/231/head:pull/231
$ git checkout pull/231
Update a local copy of the PR:
$ git checkout pull/231
$ git pull https://git.openjdk.org/panama-vector.git pull/231/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 231
View PR using the GUI difftool:
$ git pr show -t 231
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/panama-vector/pull/231.diff
Using Webrev
:wave: Welcome back jbhateja! A progress list of the required criteria for merging this PR into vectorIntrinsics will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
❗ This change is not yet ready to be integrated. See the Progress checklist in the description for automated requirements.
Performance of the FMA benchmark on Intel Xeon Emerald Rapids : INTEL(R) XEON(R) PLATINUM 8581C CPU @ 2.30GHz
⚠️ @jatin-bhateja This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).
Webrevs
What is remaining?
- Functional validation
- Performance validation
- New IR framework-based tests.
- Microbenchmark for FP16-based dotproduct.
@jatin-bhateja This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!
@jatin-bhateja Unknown command keeplive - for a list of valid commands use /help.
/keepalive
@jatin-bhateja The pull request is being re-evaluated and the inactivity timeout has been reset.
Performance of JMH micros System: Model name: INTEL(R) XEON(R) PLATINUM 8581C CPU @ 2.10GHz
Baseline:
Benchmark (size) Mode Cnt Score Error Units
Halffloat256Vector.ABS 1024 thrpt 2 366.995 ops/ms
Halffloat256Vector.ABSMasked 1024 thrpt 2 345.584 ops/ms
Halffloat256Vector.ACOS 1024 thrpt 2 61.402 ops/ms
Halffloat256Vector.ADD 1024 thrpt 2 259.029 ops/ms
Halffloat256Vector.ADDMasked 1024 thrpt 2 251.257 ops/ms
Halffloat256Vector.ASIN 1024 thrpt 2 61.191 ops/ms
Halffloat256Vector.ATAN 1024 thrpt 2 40.815 ops/ms
Halffloat256Vector.ATAN2 1024 thrpt 2 28.224 ops/ms
Halffloat256Vector.CBRT 1024 thrpt 2 43.547 ops/ms
Halffloat256Vector.COS 1024 thrpt 2 37.414 ops/ms
Halffloat256Vector.COSH 1024 thrpt 2 46.365 ops/ms
Halffloat256Vector.DIV 1024 thrpt 2 221.924 ops/ms
Halffloat256Vector.DIVMasked 1024 thrpt 2 240.560 ops/ms
Halffloat256Vector.EXP 1024 thrpt 2 52.344 ops/ms
Halffloat256Vector.EXPM1 1024 thrpt 2 48.346 ops/ms
Halffloat256Vector.FMA 1024 thrpt 2 206.324 ops/ms
Halffloat256Vector.FMAMasked 1024 thrpt 2 184.678 ops/ms
Halffloat256Vector.HYPOT 1024 thrpt 2 34.096 ops/ms
Halffloat256Vector.LOG 1024 thrpt 2 40.300 ops/ms
Halffloat256Vector.LOG10 1024 thrpt 2 38.886 ops/ms
Halffloat256Vector.LOG1P 1024 thrpt 2 36.438 ops/ms
Halffloat256Vector.MAX 1024 thrpt 2 266.337 ops/ms
Halffloat256Vector.MAXMasked 1024 thrpt 2 245.518 ops/ms
Halffloat256Vector.MIN 1024 thrpt 2 268.963 ops/ms
Halffloat256Vector.MINMasked 1024 thrpt 2 243.136 ops/ms
Halffloat256Vector.MUL 1024 thrpt 2 264.127 ops/ms
Halffloat256Vector.MULMasked 1024 thrpt 2 251.600 ops/ms
Halffloat256Vector.NEG 1024 thrpt 2 365.486 ops/ms
Halffloat256Vector.NEGMasked 1024 thrpt 2 357.070 ops/ms
Halffloat256Vector.POW 1024 thrpt 2 26.809 ops/ms
Halffloat256Vector.SIN 1024 thrpt 2 34.555 ops/ms
Halffloat256Vector.SINH 1024 thrpt 2 53.779 ops/ms
Halffloat256Vector.SQRT 1024 thrpt 2 130.811 ops/ms
Halffloat256Vector.SQRTMasked 1024 thrpt 2 192.628 ops/ms
Halffloat256Vector.SUB 1024 thrpt 2 262.521 ops/ms
Halffloat256Vector.SUBMasked 1024 thrpt 2 254.578 ops/ms
Halffloat256Vector.TAN 1024 thrpt 2 30.002 ops/ms
Halffloat256Vector.TANH 1024 thrpt 2 55.562 ops/ms
Halffloat256Vector.blend 1024 thrpt 2 28002.356 ops/ms
Withopt:-
Benchmark (size) Mode Cnt Score Error Units
Halffloat256Vector.ABS 1024 thrpt 2 24048.638 ops/ms
Halffloat256Vector.ABSMasked 1024 thrpt 2 45085.707 ops/ms
Halffloat256Vector.ACOS 1024 thrpt 2 56.116 ops/ms
Halffloat256Vector.ADD 1024 thrpt 2 19623.250 ops/ms
Halffloat256Vector.ADDMasked 1024 thrpt 2 27462.171 ops/ms
Halffloat256Vector.ASIN 1024 thrpt 2 62.081 ops/ms
Halffloat256Vector.ATAN 1024 thrpt 2 41.352 ops/ms
Halffloat256Vector.ATAN2 1024 thrpt 2 29.173 ops/ms
Halffloat256Vector.CBRT 1024 thrpt 2 39.926 ops/ms
Halffloat256Vector.COS 1024 thrpt 2 37.151 ops/ms
Halffloat256Vector.COSH 1024 thrpt 2 48.309 ops/ms
Halffloat256Vector.DIV 1024 thrpt 2 2805.701 ops/ms
Halffloat256Vector.DIVMasked 1024 thrpt 2 2795.544 ops/ms
Halffloat256Vector.EXP 1024 thrpt 2 55.055 ops/ms
Halffloat256Vector.EXPM1 1024 thrpt 2 50.483 ops/ms
Halffloat256Vector.FMA 1024 thrpt 2 23280.064 ops/ms
Halffloat256Vector.FMAMasked 1024 thrpt 2 21828.932 ops/ms
Halffloat256Vector.HYPOT 1024 thrpt 2 34.266 ops/ms
Halffloat256Vector.LOG 1024 thrpt 2 42.158 ops/ms
Halffloat256Vector.LOG10 1024 thrpt 2 41.335 ops/ms
Halffloat256Vector.LOG1P 1024 thrpt 2 36.291 ops/ms
Halffloat256Vector.MAX 1024 thrpt 2 14960.348 ops/ms
Halffloat256Vector.MAXMasked 1024 thrpt 2 12585.642 ops/ms
Halffloat256Vector.MIN 1024 thrpt 2 14662.769 ops/ms
Halffloat256Vector.MINMasked 1024 thrpt 2 12327.769 ops/ms
Halffloat256Vector.MUL 1024 thrpt 2 27156.965 ops/ms
Halffloat256Vector.MULMasked 1024 thrpt 2 21349.555 ops/ms
Halffloat256Vector.NEG 1024 thrpt 2 24093.711 ops/ms
Halffloat256Vector.NEGMasked 1024 thrpt 2 26889.264 ops/ms
Halffloat256Vector.POW 1024 thrpt 2 27.028 ops/ms
Halffloat256Vector.SIN 1024 thrpt 2 34.280 ops/ms
Halffloat256Vector.SINH 1024 thrpt 2 55.049 ops/ms
Halffloat256Vector.SQRT 1024 thrpt 2 2491.596 ops/ms
Halffloat256Vector.SQRTMasked 1024 thrpt 2 2493.591 ops/ms
Halffloat256Vector.SUB 1024 thrpt 2 29664.499 ops/ms
Halffloat256Vector.SUBMasked 1024 thrpt 2 25384.305 ops/ms
Halffloat256Vector.TAN 1024 thrpt 2 29.754 ops/ms
Halffloat256Vector.TANH 1024 thrpt 2 55.933 ops/ms
Halffloat256Vector.blend 1024 thrpt 2 22681.727 ops/ms
What is remaining?
Functional validation Through performance validation New IR framework-based tests. Microbenchmark for FP16-based dotproduct.
Integrating this PR, the remaining work will be part of JDK-mainline PR pull/28002
@jatin-bhateja This pull request has not yet been marked as ready for integration.