jdk
jdk copied to clipboard
8314125: RISC-V: implement Base64 intrinsic - encoding
Hi, Can you help to review the patch?
I'm also working a base64 decode instrinsic, but there is some performance regression in some cases, and decode and encode are totally independent with each other, so I will send out review of decode in another pr when I fix the performance regression in it.
Thanks.
Test
benchmarks run on CanVM-K230
I've tried several implementations, respectively with vector group
- m2+m1+scalar
- m2+scalar
- m1+scalar
- pure scalar The best one is combination of m2+m1, it have best performance in all source size.
this implementation (m2+m1)
| Benchmark | (maxNumBytes) | Mode | Cnt | Score -intrinsic | Score + instrinsic, m1+m2 | Error | Units | -intrinsic/+intrinsic |
|---|---|---|---|---|---|---|---|---|
| Base64Encode.testBase64Encode | 1 | avgt | 10 | 86.784 | 86.996 | 0.459 | ns/op | 0.9975631063 |
| Base64Encode.testBase64Encode | 2 | avgt | 10 | 93.603 | 94.026 | 1.081 | ns/op | 0.9955012443 |
| Base64Encode.testBase64Encode | 3 | avgt | 10 | 121.927 | 123.227 | 0.342 | ns/op | 0.989450364 |
| Base64Encode.testBase64Encode | 6 | avgt | 10 | 139.554 | 137.4 | 1.221 | ns/op | 1.015676856 |
| Base64Encode.testBase64Encode | 7 | avgt | 10 | 160.698 | 162.25 | 2.36 | ns/op | 0.9904345146 |
| Base64Encode.testBase64Encode | 9 | avgt | 10 | 161.085 | 153.772 | 1.505 | ns/op | 1.047557423 |
| Base64Encode.testBase64Encode | 10 | avgt | 10 | 187.963 | 174.763 | 1.204 | ns/op | 1.075530862 |
| Base64Encode.testBase64Encode | 48 | avgt | 10 | 405.212 | 199.4 | 6.374 | ns/op | 2.032156469 |
| Base64Encode.testBase64Encode | 512 | avgt | 10 | 3652.555 | 1111.009 | 3.462 | ns/op | 3.287601631 |
| Base64Encode.testBase64Encode | 1000 | avgt | 10 | 7217.187 | 2011.943 | 227.784 | ns/op | 3.587172698 |
| Base64Encode.testBase64Encode | 20000 | avgt | 10 | 135165.706 | 33864.592 | 57.557 | ns/op | 3.991357876 |
vector with only m2
| Benchmark | (maxNumBytes) | Mode | Cnt | Score -intrinsics | Score +intrinsic, m2 | Error | Units | -intrinsic/+intrinsic |
|---|---|---|---|---|---|---|---|---|
| Base64Encode.testBase64Encode | 1 | avgt | 10 | 86.797 | 86.872 | 0.374 | ns/op | 0.9991366608 |
| Base64Encode.testBase64Encode | 2 | avgt | 10 | 93.971 | 94.203 | 1.918 | ns/op | 0.9975372334 |
| Base64Encode.testBase64Encode | 3 | avgt | 10 | 122.074 | 123.978 | 1.009 | ns/op | 0.9846424366 |
| Base64Encode.testBase64Encode | 6 | avgt | 10 | 138.999 | 138.344 | 2.175 | ns/op | 1.004734575 |
| Base64Encode.testBase64Encode | 7 | avgt | 10 | 160.857 | 157.494 | 1.036 | ns/op | 1.021353194 |
| Base64Encode.testBase64Encode | 9 | avgt | 10 | 161.511 | 154.998 | 1.727 | ns/op | 1.042019897 |
| Base64Encode.testBase64Encode | 10 | avgt | 10 | 186.228 | 175.38 | 0.62 | ns/op | 1.061854259 |
| Base64Encode.testBase64Encode | 48 | avgt | 10 | 408.461 | 349.558 | 15.377 | ns/op | 1.168507086 |
| Base64Encode.testBase64Encode | 512 | avgt | 10 | 3679.283 | 1103.717 | 3.911 | ns/op | 3.333538398 |
| Base64Encode.testBase64Encode | 1000 | avgt | 10 | 7206.265 | 1988.927 | 224.732 | ns/op | 3.623192304 |
| Base64Encode.testBase64Encode | 20000 | avgt | 10 | 135695.875 | 33930.292 | 97.85 | ns/op | 3.99925456 |
vector with only m1
| Benchmark | (maxNumBytes) | Mode | Cnt | Score -intrinsic | Score +intrinsic, m1 | Error | Units | -intrinsic/+intrinsic |
|---|---|---|---|---|---|---|---|---|
| Base64Encode.testBase64Encode | 1 | avgt | 10 | 86.837 | 87.137 | 0.527 | ns/op | 0.9965571456 |
| Base64Encode.testBase64Encode | 2 | avgt | 10 | 94.723 | 94.125 | 5.122 | ns/op | 1.006353254 |
| Base64Encode.testBase64Encode | 3 | avgt | 10 | 121.51 | 123.082 | 0.854 | ns/op | 0.9872280268 |
| Base64Encode.testBase64Encode | 6 | avgt | 10 | 139.045 | 137.175 | 0.201 | ns/op | 1.013632222 |
| Base64Encode.testBase64Encode | 7 | avgt | 10 | 161.216 | 159.387 | 2.385 | ns/op | 1.011475214 |
| Base64Encode.testBase64Encode | 9 | avgt | 10 | 160.541 | 154.19 | 1.665 | ns/op | 1.041189442 |
| Base64Encode.testBase64Encode | 10 | avgt | 10 | 184.874 | 174.766 | 5.569 | ns/op | 1.057837337 |
| Base64Encode.testBase64Encode | 48 | avgt | 10 | 405.124 | 199.333 | 1.584 | ns/op | 2.032398047 |
| Base64Encode.testBase64Encode | 512 | avgt | 10 | 3659.335 | 1185.626 | 24.686 | ns/op | 3.086415952 |
| Base64Encode.testBase64Encode | 1000 | avgt | 10 | 7239.269 | 2164.709 | 1022.367 | ns/op | 3.344222711 |
| Base64Encode.testBase64Encode | 20000 | avgt | 10 | 135048.828 | 38248.645 | 319.978 | ns/op | 3.530813392 |
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
Issue
- JDK-8314125: RISC-V: implement Base64 intrinsic - encoding (Enhancement - P4)
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19973/head:pull/19973
$ git checkout pull/19973
Update a local copy of the PR:
$ git checkout pull/19973
$ git pull https://git.openjdk.org/jdk.git pull/19973/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 19973
View PR using the GUI difftool:
$ git pr show -t 19973
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19973.diff
Webrev
:wave: Welcome back mli! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
@Hamlin-Li This change now passes all automated pre-integration checks.
ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.
After integration, the commit message for the final commit will be:
8314125: RISC-V: implement Base64 intrinsic - encoding
Reviewed-by: fyang, luhenry
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.
At the time when this comment was updated there had been 59 new commits pushed to the master branch:
- 0c1e9111d226b601236b9826e27ecc67a8b625fb: 8338019: Fix simple -Wzero-as-null-pointer-constant warnings in riscv code
- 55c509708e9b89a7609fd41b6e5a271f250bbacd: 8337331: crash: pinned virtual thread will lead to jvm crash when running with the javaagent option
- 9f08a01cb6ebb08f67749aabdff4efaedfaf3228: 8338010: WB_IsFrameDeoptimized miss ResourceMark
- e7a0b5b09bcfcd8b09667e51ec588e206f0652ff: 8334780: Crash: assert(h_array_list.not_null()) failed: invariant
- bfb75b96266f4c5840e2edea13f9aa2c801518cf: 8336926: jdk/internal/util/ReferencedKeyTest.java can fail with ConcurrentModificationException
- 53c9f037fb2805aab5ab9729166ce7d5bc7a77f9: 8336849: Remove .llvm_addrsig section from JDK/VM static libraries (.a files)
- 9695f09581bac856ad97943cca15c65dc21d2adf: 8337683: Fix -Wconversion problem with arrayOop.hpp
- 78773b1769922611318243b6680d59518ed4e015: 8338015: Fix "Java Java" typo in package info file of java.lang.classfile
- 6a9a867d645b8fe86f4ca2b04a43bf5aa8f9d487: 8337994: [REDO] Native memory leak when not recording any events
- 4b740d87ee50ba49305add4aae6490bebb6963ba: 8225209: jdk/jfr/event/compiler/TestCodeSweeper.java fails
- ... and 49 more: https://git.openjdk.org/jdk/compare/be34730fb4e6818ac13c46b34b735c967351e5cd...master
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.
➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.
@Hamlin-Li The following label will be automatically applied to this pull request:
hotspot
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.
Webrevs
- 08: Full - Incremental (7bc6b8e4)
- 07: Full - Incremental (5f1288ae)
- 06: Full (111cd370)
- 05: Full (e49494a2)
- 04: Full - Incremental (736f5f8b)
- 03: Full - Incremental (8645a6a1)
- 02: Full - Incremental (264b354b)
- 01: Full - Incremental (cf732984)
- 00: Full (fc32d9fa)
with pure scalar impelmentation, it also bring some performance imrpovement in all source size, so also enable the intrinsic when rvv is not supported.
performance data
| Benchmark | (maxNumBytes) | Mode | Cnt | Score -intrinsic | Score +instrinsic, scalar | Error | Units | Perf opt |
|---|---|---|---|---|---|---|---|---|
| Base64Encode.testBase64Encode | 1 | avgt | 10 | 86.784 | 86.75 | 0.38 | ns/op | 1 |
| Base64Encode.testBase64Encode | 2 | avgt | 10 | 93.71 | 93.824 | 1.954 | ns/op | 0.999 |
| Base64Encode.testBase64Encode | 3 | avgt | 10 | 121.824 | 123.487 | 0.559 | ns/op | 0.987 |
| Base64Encode.testBase64Encode | 6 | avgt | 10 | 138.984 | 137.697 | 0.273 | ns/op | 1.009 |
| Base64Encode.testBase64Encode | 7 | avgt | 10 | 161.243 | 157.696 | 0.875 | ns/op | 1.022 |
| Base64Encode.testBase64Encode | 9 | avgt | 10 | 169.724 | 155.223 | 1.908 | ns/op | 1.093 |
| Base64Encode.testBase64Encode | 10 | avgt | 10 | 185.92 | 176.339 | 5.875 | ns/op | 1.054 |
| Base64Encode.testBase64Encode | 48 | avgt | 10 | 408.467 | 347.269 | 1.799 | ns/op | 1.176 |
| Base64Encode.testBase64Encode | 512 | avgt | 10 | 3665.34 | 2718.442 | 26.954 | ns/op | 1.348 |
| Base64Encode.testBase64Encode | 1000 | avgt | 10 | 7022.025 | 5290.003 | 33.216 | ns/op | 1.327 |
| Base64Encode.testBase64Encode | 20000 | avgt | 10 | 135819.7 | 101988.94 | 2209.887 | ns/op | 1.332 |
@Hamlin-Li Hi, we looked at RVV base64 encode/decode for another project before, however there wasn't one implementation that obviously was best across the different hardware: https://github.com/WojciechMula/base64simd/issues/9 (see issue for benchmark, and repo for code)
I think we currently can't tell how, the complex load/stores will perform on future hardware. Segmented load/stores for example are quite fast on the current in-order RVV 1.0 boards, however it's very slow on the ooo C910, and XiangShan (current master, may change) cores (SiFive P670 LLVM-MCA indicates that it might also be slow on that core). I'm not sure if that is because they are ooo and that gives you additional constraints, but I wouldn't rely on it just yet.
I think the safest bet for encode would be for now "RISC-V RVV (LMUL=1)" (encode + lookup_pshufb_improved), as this only uses instructions with predictable performance, except for LMUL=1 vrgather.vv, which I think will need to be fast on any application class core. (See x86 equivalent vperm*)
For decode, I'm not really happy with any implementation. Yours uses multiple vluxei8 + vlsege4 + vssege3, the others from base64simd use LMUL=8 vrgather.vv, which will take LMUL^2=8^2=64 times the amount of cycles a LMUL=1 vrgather.vv takes (on sane implementations, see my reasoning). As I said, I'm fairly certain LMUL=1 vrgather.vv will have to be relatively fast, so if I had to choose, I'd prefer my implementation that uses LMUL=1 vrgather.vvs + vlsege4 + vssege3, but using vsseg* is not ideal. (Note that gcc currently chokes on the register allocation, so you should use clang for now)
Thanks a lot for sharing the information.
@Hamlin-Li Hi, we looked at RVV base64 encode/decode for another project before, however there wasn't one implementation that obviously was best across the different hardware: WojciechMula/base64simd#9 (see issue for benchmark, and repo for code)
Agree, I think your observation is right.
I think we currently can't tell how, the complex load/stores will perform on future hardware. Segmented load/stores for example are quite fast on the current in-order RVV 1.0 boards, however it's very slow on the ooo C910, and XiangShan (current master, may change) cores (SiFive P670 LLVM-MCA indicates that it might also be slow on that core). I'm not sure if that is because they are ooo and that gives you additional constraints, but I wouldn't rely on it just yet.
I don't know how that (it's very slow on the ooo) happens and currently I don't have these types of machine. And it's bit strange that they are very slow with those instructions, could it be that they are not fully optimized for those instructions on these machines?
I think the safest bet for encode would be for now "RISC-V RVV (LMUL=1)" (
encode+lookup_pshufb_improved), as this only uses instructions with predictable performance, except for LMUL=1vrgather.vv, which I think will need to be fast on any application class core. (See x86 equivalent vperm*)
My current tests on k230 shows that m2+m1+scalar bring the best performance on all size values, I'd like to see test data on other hardwares if someone can help test and get the data. And, for current implementation it's easy to adjust lmul value in the algorithm. So I'm flexiable to either lmul value based the test data.
For decode, I'm not really happy with any implementation. Yours uses multiple
vluxei8+vlsege4+vssege3, the others from base64simd use LMUL=8vrgather.vv, which will takeLMUL^2=8^2=64times the amount of cycles a LMUL=1vrgather.vvtakes (on sane implementations, see my reasoning). As I said, I'm fairly certain LMUL=1vrgather.vvwill have to be relatively fast, so if I had to choose, I'd prefer my implementation that uses LMUL=1vrgather.vvs +vlsege4+vssege3, but usingvsseg*is not ideal. (Note that gcc currently chokes on the register allocation, so you should use clang for now)
I import your implementation into jdk, but compared to my current decode implementation, it brings much regression. Let's discuss about decode in https://github.com/openjdk/jdk/pull/20026.
Hi, Can I get another review of this pr? Thanks!
@Hamlin-Li this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:
git checkout baes64-encode-integrated
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push
Hi, will take a look. BTW: Have you resolved the performance issue of base64 decode instrinsic?
Hi, will take a look.
Thanks!
BTW: Have you resolved the performance issue of base64 decode instrinsic?
It only bring regression in MIME cases, I did some investigation, but not found root cause yet.
Hey, Is anyone available to take another look? Thanks!
Hi, Sorry for the late reply. Severa comments remain after a second look. Thanks.
No worry. :) Thanks for reviewing.
@RealFYang @luhenry Thanks for your reviewing.
/integrate
Going to push as commit c37e8638c98cb4516569304e9a0ab477affb0641.
Since your change was applied there have been 66 commits pushed to the master branch:
- 6ebd5d74d57b334e7cf0b1282d7bb469a56fb3d6: 8338036: Serial: Remove Generation::update_counters
- 53fce38a3ff8700fef640fffc066efc21ff9c25f: 8338062: JFR: Remove TestStartDuration.java and TestStartName.java from ProblemList.txt
- f74109bd178c92a9dff1ca6fce03b25f51a0384f: 8337939: ZGC: Make assertions and checks less convoluted and explicit
- 82d5768c1bdccfaf97a41f32a8bfcfd03a0fb378: 8337840: Remove redundant null check in ObjectOutputStream.writeProxyDesc
- c01f53ac2dab1d4d2cd1e4d45a67f9373d4a9c7e: 8337876: [IR Framework] Add support for IR tests with @Stable
- 00aac4097abd3c5e6144734cfd44228bc31892fb: 8338058: map_or_reserve_memory_aligned Windows enhance remap assertion
- 9ab8c6b9ba90ffd12600a250c8704571e9feb78d: 8335130: The test "javax/swing/plaf/synth/ComponentsOrientationSupport/5033822/bug5033822.java" fails because the background color of the tabs is displayed incorrectly.
- 0c1e9111d226b601236b9826e27ecc67a8b625fb: 8338019: Fix simple -Wzero-as-null-pointer-constant warnings in riscv code
- 55c509708e9b89a7609fd41b6e5a271f250bbacd: 8337331: crash: pinned virtual thread will lead to jvm crash when running with the javaagent option
- 9f08a01cb6ebb08f67749aabdff4efaedfaf3228: 8338010: WB_IsFrameDeoptimized miss ResourceMark
- ... and 56 more: https://git.openjdk.org/jdk/compare/be34730fb4e6818ac13c46b34b735c967351e5cd...master
Your commit was automatically rebased without conflicts.
@Hamlin-Li Pushed as commit c37e8638c98cb4516569304e9a0ab477affb0641.
:bulb: You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.