jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8314125: RISC-V: implement Base64 intrinsic - encoding

Open Hamlin-Li opened this issue 1 year ago • 5 comments

Hi, Can you help to review the patch?

I'm also working a base64 decode instrinsic, but there is some performance regression in some cases, and decode and encode are totally independent with each other, so I will send out review of decode in another pr when I fix the performance regression in it.

Thanks.

Test

benchmarks run on CanVM-K230

I've tried several implementations, respectively with vector group

  • m2+m1+scalar
  • m2+scalar
  • m1+scalar
  • pure scalar The best one is combination of m2+m1, it have best performance in all source size.

this implementation (m2+m1)

Benchmark (maxNumBytes) Mode Cnt Score -intrinsic Score + instrinsic, m1+m2 Error Units -intrinsic/+intrinsic
Base64Encode.testBase64Encode 1 avgt 10 86.784 86.996 0.459 ns/op 0.9975631063
Base64Encode.testBase64Encode 2 avgt 10 93.603 94.026 1.081 ns/op 0.9955012443
Base64Encode.testBase64Encode 3 avgt 10 121.927 123.227 0.342 ns/op 0.989450364
Base64Encode.testBase64Encode 6 avgt 10 139.554 137.4 1.221 ns/op 1.015676856
Base64Encode.testBase64Encode 7 avgt 10 160.698 162.25 2.36 ns/op 0.9904345146
Base64Encode.testBase64Encode 9 avgt 10 161.085 153.772 1.505 ns/op 1.047557423
Base64Encode.testBase64Encode 10 avgt 10 187.963 174.763 1.204 ns/op 1.075530862
Base64Encode.testBase64Encode 48 avgt 10 405.212 199.4 6.374 ns/op 2.032156469
Base64Encode.testBase64Encode 512 avgt 10 3652.555 1111.009 3.462 ns/op 3.287601631
Base64Encode.testBase64Encode 1000 avgt 10 7217.187 2011.943 227.784 ns/op 3.587172698
Base64Encode.testBase64Encode 20000 avgt 10 135165.706 33864.592 57.557 ns/op 3.991357876

vector with only m2

Benchmark (maxNumBytes) Mode Cnt Score -intrinsics Score +intrinsic, m2 Error Units -intrinsic/+intrinsic
Base64Encode.testBase64Encode 1 avgt 10 86.797 86.872 0.374 ns/op 0.9991366608
Base64Encode.testBase64Encode 2 avgt 10 93.971 94.203 1.918 ns/op 0.9975372334
Base64Encode.testBase64Encode 3 avgt 10 122.074 123.978 1.009 ns/op 0.9846424366
Base64Encode.testBase64Encode 6 avgt 10 138.999 138.344 2.175 ns/op 1.004734575
Base64Encode.testBase64Encode 7 avgt 10 160.857 157.494 1.036 ns/op 1.021353194
Base64Encode.testBase64Encode 9 avgt 10 161.511 154.998 1.727 ns/op 1.042019897
Base64Encode.testBase64Encode 10 avgt 10 186.228 175.38 0.62 ns/op 1.061854259
Base64Encode.testBase64Encode 48 avgt 10 408.461 349.558 15.377 ns/op 1.168507086
Base64Encode.testBase64Encode 512 avgt 10 3679.283 1103.717 3.911 ns/op 3.333538398
Base64Encode.testBase64Encode 1000 avgt 10 7206.265 1988.927 224.732 ns/op 3.623192304
Base64Encode.testBase64Encode 20000 avgt 10 135695.875 33930.292 97.85 ns/op 3.99925456

vector with only m1

Benchmark (maxNumBytes) Mode Cnt Score -intrinsic Score +intrinsic, m1 Error Units -intrinsic/+intrinsic
Base64Encode.testBase64Encode 1 avgt 10 86.837 87.137 0.527 ns/op 0.9965571456
Base64Encode.testBase64Encode 2 avgt 10 94.723 94.125 5.122 ns/op 1.006353254
Base64Encode.testBase64Encode 3 avgt 10 121.51 123.082 0.854 ns/op 0.9872280268
Base64Encode.testBase64Encode 6 avgt 10 139.045 137.175 0.201 ns/op 1.013632222
Base64Encode.testBase64Encode 7 avgt 10 161.216 159.387 2.385 ns/op 1.011475214
Base64Encode.testBase64Encode 9 avgt 10 160.541 154.19 1.665 ns/op 1.041189442
Base64Encode.testBase64Encode 10 avgt 10 184.874 174.766 5.569 ns/op 1.057837337
Base64Encode.testBase64Encode 48 avgt 10 405.124 199.333 1.584 ns/op 2.032398047
Base64Encode.testBase64Encode 512 avgt 10 3659.335 1185.626 24.686 ns/op 3.086415952
Base64Encode.testBase64Encode 1000 avgt 10 7239.269 2164.709 1022.367 ns/op 3.344222711
Base64Encode.testBase64Encode 20000 avgt 10 135048.828 38248.645 319.978 ns/op 3.530813392

Progress

  • [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue

Issue

  • JDK-8314125: RISC-V: implement Base64 intrinsic - encoding (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19973/head:pull/19973
$ git checkout pull/19973

Update a local copy of the PR:
$ git checkout pull/19973
$ git pull https://git.openjdk.org/jdk.git pull/19973/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 19973

View PR using the GUI difftool:
$ git pr show -t 19973

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19973.diff

Webrev

Link to Webrev Comment

Hamlin-Li avatar Jul 01 '24 14:07 Hamlin-Li

:wave: Welcome back mli! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar Jul 01 '24 14:07 bridgekeeper[bot]

@Hamlin-Li This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8314125: RISC-V: implement Base64 intrinsic - encoding

Reviewed-by: fyang, luhenry

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 59 new commits pushed to the master branch:

  • 0c1e9111d226b601236b9826e27ecc67a8b625fb: 8338019: Fix simple -Wzero-as-null-pointer-constant warnings in riscv code
  • 55c509708e9b89a7609fd41b6e5a271f250bbacd: 8337331: crash: pinned virtual thread will lead to jvm crash when running with the javaagent option
  • 9f08a01cb6ebb08f67749aabdff4efaedfaf3228: 8338010: WB_IsFrameDeoptimized miss ResourceMark
  • e7a0b5b09bcfcd8b09667e51ec588e206f0652ff: 8334780: Crash: assert(h_array_list.not_null()) failed: invariant
  • bfb75b96266f4c5840e2edea13f9aa2c801518cf: 8336926: jdk/internal/util/ReferencedKeyTest.java can fail with ConcurrentModificationException
  • 53c9f037fb2805aab5ab9729166ce7d5bc7a77f9: 8336849: Remove .llvm_addrsig section from JDK/VM static libraries (.a files)
  • 9695f09581bac856ad97943cca15c65dc21d2adf: 8337683: Fix -Wconversion problem with arrayOop.hpp
  • 78773b1769922611318243b6680d59518ed4e015: 8338015: Fix "Java Java" typo in package info file of java.lang.classfile
  • 6a9a867d645b8fe86f4ca2b04a43bf5aa8f9d487: 8337994: [REDO] Native memory leak when not recording any events
  • 4b740d87ee50ba49305add4aae6490bebb6963ba: 8225209: jdk/jfr/event/compiler/TestCodeSweeper.java fails
  • ... and 49 more: https://git.openjdk.org/jdk/compare/be34730fb4e6818ac13c46b34b735c967351e5cd...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

openjdk[bot] avatar Jul 01 '24 14:07 openjdk[bot]

@Hamlin-Li The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar Jul 01 '24 14:07 openjdk[bot]

with pure scalar impelmentation, it also bring some performance imrpovement in all source size, so also enable the intrinsic when rvv is not supported.

performance data

Benchmark (maxNumBytes) Mode Cnt Score -intrinsic Score +instrinsic, scalar Error Units Perf opt
Base64Encode.testBase64Encode 1 avgt 10 86.784 86.75 0.38 ns/op 1
Base64Encode.testBase64Encode 2 avgt 10 93.71 93.824 1.954 ns/op 0.999
Base64Encode.testBase64Encode 3 avgt 10 121.824 123.487 0.559 ns/op 0.987
Base64Encode.testBase64Encode 6 avgt 10 138.984 137.697 0.273 ns/op 1.009
Base64Encode.testBase64Encode 7 avgt 10 161.243 157.696 0.875 ns/op 1.022
Base64Encode.testBase64Encode 9 avgt 10 169.724 155.223 1.908 ns/op 1.093
Base64Encode.testBase64Encode 10 avgt 10 185.92 176.339 5.875 ns/op 1.054
Base64Encode.testBase64Encode 48 avgt 10 408.467 347.269 1.799 ns/op 1.176
Base64Encode.testBase64Encode 512 avgt 10 3665.34 2718.442 26.954 ns/op 1.348
Base64Encode.testBase64Encode 1000 avgt 10 7022.025 5290.003 33.216 ns/op 1.327
Base64Encode.testBase64Encode 20000 avgt 10 135819.7 101988.94 2209.887 ns/op 1.332

Hamlin-Li avatar Jul 01 '24 15:07 Hamlin-Li

@Hamlin-Li Hi, we looked at RVV base64 encode/decode for another project before, however there wasn't one implementation that obviously was best across the different hardware: https://github.com/WojciechMula/base64simd/issues/9 (see issue for benchmark, and repo for code)

I think we currently can't tell how, the complex load/stores will perform on future hardware. Segmented load/stores for example are quite fast on the current in-order RVV 1.0 boards, however it's very slow on the ooo C910, and XiangShan (current master, may change) cores (SiFive P670 LLVM-MCA indicates that it might also be slow on that core). I'm not sure if that is because they are ooo and that gives you additional constraints, but I wouldn't rely on it just yet.

I think the safest bet for encode would be for now "RISC-V RVV (LMUL=1)" (encode + lookup_pshufb_improved), as this only uses instructions with predictable performance, except for LMUL=1 vrgather.vv, which I think will need to be fast on any application class core. (See x86 equivalent vperm*)

For decode, I'm not really happy with any implementation. Yours uses multiple vluxei8 + vlsege4 + vssege3, the others from base64simd use LMUL=8 vrgather.vv, which will take LMUL^2=8^2=64 times the amount of cycles a LMUL=1 vrgather.vv takes (on sane implementations, see my reasoning). As I said, I'm fairly certain LMUL=1 vrgather.vv will have to be relatively fast, so if I had to choose, I'd prefer my implementation that uses LMUL=1 vrgather.vvs + vlsege4 + vssege3, but using vsseg* is not ideal. (Note that gcc currently chokes on the register allocation, so you should use clang for now)

camel-cdr avatar Jul 04 '24 17:07 camel-cdr

Thanks a lot for sharing the information.

@Hamlin-Li Hi, we looked at RVV base64 encode/decode for another project before, however there wasn't one implementation that obviously was best across the different hardware: WojciechMula/base64simd#9 (see issue for benchmark, and repo for code)

Agree, I think your observation is right.

I think we currently can't tell how, the complex load/stores will perform on future hardware. Segmented load/stores for example are quite fast on the current in-order RVV 1.0 boards, however it's very slow on the ooo C910, and XiangShan (current master, may change) cores (SiFive P670 LLVM-MCA indicates that it might also be slow on that core). I'm not sure if that is because they are ooo and that gives you additional constraints, but I wouldn't rely on it just yet.

I don't know how that (it's very slow on the ooo) happens and currently I don't have these types of machine. And it's bit strange that they are very slow with those instructions, could it be that they are not fully optimized for those instructions on these machines?

I think the safest bet for encode would be for now "RISC-V RVV (LMUL=1)" (encode + lookup_pshufb_improved), as this only uses instructions with predictable performance, except for LMUL=1 vrgather.vv, which I think will need to be fast on any application class core. (See x86 equivalent vperm*)

My current tests on k230 shows that m2+m1+scalar bring the best performance on all size values, I'd like to see test data on other hardwares if someone can help test and get the data. And, for current implementation it's easy to adjust lmul value in the algorithm. So I'm flexiable to either lmul value based the test data.

Hamlin-Li avatar Jul 05 '24 13:07 Hamlin-Li

For decode, I'm not really happy with any implementation. Yours uses multiple vluxei8 + vlsege4 + vssege3, the others from base64simd use LMUL=8 vrgather.vv, which will take LMUL^2=8^2=64 times the amount of cycles a LMUL=1 vrgather.vv takes (on sane implementations, see my reasoning). As I said, I'm fairly certain LMUL=1 vrgather.vv will have to be relatively fast, so if I had to choose, I'd prefer my implementation that uses LMUL=1 vrgather.vvs + vlsege4 + vssege3, but using vsseg* is not ideal. (Note that gcc currently chokes on the register allocation, so you should use clang for now)

I import your implementation into jdk, but compared to my current decode implementation, it brings much regression. Let's discuss about decode in https://github.com/openjdk/jdk/pull/20026.

Hamlin-Li avatar Jul 05 '24 13:07 Hamlin-Li

Hi, Can I get another review of this pr? Thanks!

Hamlin-Li avatar Jul 22 '24 07:07 Hamlin-Li

@Hamlin-Li this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout baes64-encode-integrated
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

openjdk[bot] avatar Jul 26 '24 07:07 openjdk[bot]

Hi, will take a look. BTW: Have you resolved the performance issue of base64 decode instrinsic?

RealFYang avatar Jul 29 '24 05:07 RealFYang

Hi, will take a look.

Thanks!

BTW: Have you resolved the performance issue of base64 decode instrinsic?

It only bring regression in MIME cases, I did some investigation, but not found root cause yet.

Hamlin-Li avatar Jul 29 '24 14:07 Hamlin-Li

Hey, Is anyone available to take another look? Thanks!

Hamlin-Li avatar Aug 08 '24 12:08 Hamlin-Li

Hi, Sorry for the late reply. Severa comments remain after a second look. Thanks.

No worry. :) Thanks for reviewing.

Hamlin-Li avatar Aug 08 '24 20:08 Hamlin-Li

@RealFYang @luhenry Thanks for your reviewing.

/integrate

Hamlin-Li avatar Aug 09 '24 09:08 Hamlin-Li

Going to push as commit c37e8638c98cb4516569304e9a0ab477affb0641. Since your change was applied there have been 66 commits pushed to the master branch:

  • 6ebd5d74d57b334e7cf0b1282d7bb469a56fb3d6: 8338036: Serial: Remove Generation::update_counters
  • 53fce38a3ff8700fef640fffc066efc21ff9c25f: 8338062: JFR: Remove TestStartDuration.java and TestStartName.java from ProblemList.txt
  • f74109bd178c92a9dff1ca6fce03b25f51a0384f: 8337939: ZGC: Make assertions and checks less convoluted and explicit
  • 82d5768c1bdccfaf97a41f32a8bfcfd03a0fb378: 8337840: Remove redundant null check in ObjectOutputStream.writeProxyDesc
  • c01f53ac2dab1d4d2cd1e4d45a67f9373d4a9c7e: 8337876: [IR Framework] Add support for IR tests with @Stable
  • 00aac4097abd3c5e6144734cfd44228bc31892fb: 8338058: map_or_reserve_memory_aligned Windows enhance remap assertion
  • 9ab8c6b9ba90ffd12600a250c8704571e9feb78d: 8335130: The test "javax/swing/plaf/synth/ComponentsOrientationSupport/5033822/bug5033822.java" fails because the background color of the tabs is displayed incorrectly.
  • 0c1e9111d226b601236b9826e27ecc67a8b625fb: 8338019: Fix simple -Wzero-as-null-pointer-constant warnings in riscv code
  • 55c509708e9b89a7609fd41b6e5a271f250bbacd: 8337331: crash: pinned virtual thread will lead to jvm crash when running with the javaagent option
  • 9f08a01cb6ebb08f67749aabdff4efaedfaf3228: 8338010: WB_IsFrameDeoptimized miss ResourceMark
  • ... and 56 more: https://git.openjdk.org/jdk/compare/be34730fb4e6818ac13c46b34b735c967351e5cd...master

Your commit was automatically rebased without conflicts.

openjdk[bot] avatar Aug 09 '24 09:08 openjdk[bot]

@Hamlin-Li Pushed as commit c37e8638c98cb4516569304e9a0ab477affb0641.

:bulb: You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

openjdk[bot] avatar Aug 09 '24 09:08 openjdk[bot]