jdk 8314125: RISC-V: implement Base64 intrinsic

Hi, Can you help to review the patch?

I'm also working a base64 decode instrinsic, but there is some performance regression in some cases, and decode and encode are totally independent with each other, so I will send out review of decode in another pr when I fix the performance regression in it.

Thanks.

Test

benchmarks run on CanVM-K230

I've tried several implementations, respectively with vector group

m2+m1+scalar
m2+scalar
m1+scalar
pure scalar The best one is combination of m2+m1, it have best performance in all source size.

this implementation (m2+m1)

Benchmark	(maxNumBytes)	Mode	Cnt	Score -intrinsic	Score + instrinsic, m1+m2	Error	Units	-intrinsic/+intrinsic
Base64Encode.testBase64Encode	1	avgt	10	86.784	86.996	0.459	ns/op	0.9975631063
Base64Encode.testBase64Encode	2	avgt	10	93.603	94.026	1.081	ns/op	0.9955012443
Base64Encode.testBase64Encode	3	avgt	10	121.927	123.227	0.342	ns/op	0.989450364
Base64Encode.testBase64Encode	6	avgt	10	139.554	137.4	1.221	ns/op	1.015676856
Base64Encode.testBase64Encode	7	avgt	10	160.698	162.25	2.36	ns/op	0.9904345146
Base64Encode.testBase64Encode	9	avgt	10	161.085	153.772	1.505	ns/op	1.047557423
Base64Encode.testBase64Encode	10	avgt	10	187.963	174.763	1.204	ns/op	1.075530862
Base64Encode.testBase64Encode	48	avgt	10	405.212	199.4	6.374	ns/op	2.032156469
Base64Encode.testBase64Encode	512	avgt	10	3652.555	1111.009	3.462	ns/op	3.287601631
Base64Encode.testBase64Encode	1000	avgt	10	7217.187	2011.943	227.784	ns/op	3.587172698
Base64Encode.testBase64Encode	20000	avgt	10	135165.706	33864.592	57.557	ns/op	3.991357876

vector with only m2

Benchmark	(maxNumBytes)	Mode	Cnt	Score -intrinsics	Score +intrinsic, m2	Error	Units	-intrinsic/+intrinsic
Base64Encode.testBase64Encode	1	avgt	10	86.797	86.872	0.374	ns/op	0.9991366608
Base64Encode.testBase64Encode	2	avgt	10	93.971	94.203	1.918	ns/op	0.9975372334
Base64Encode.testBase64Encode	3	avgt	10	122.074	123.978	1.009	ns/op	0.9846424366
Base64Encode.testBase64Encode	6	avgt	10	138.999	138.344	2.175	ns/op	1.004734575
Base64Encode.testBase64Encode	7	avgt	10	160.857	157.494	1.036	ns/op	1.021353194
Base64Encode.testBase64Encode	9	avgt	10	161.511	154.998	1.727	ns/op	1.042019897
Base64Encode.testBase64Encode	10	avgt	10	186.228	175.38	0.62	ns/op	1.061854259
Base64Encode.testBase64Encode	48	avgt	10	408.461	349.558	15.377	ns/op	1.168507086
Base64Encode.testBase64Encode	512	avgt	10	3679.283	1103.717	3.911	ns/op	3.333538398
Base64Encode.testBase64Encode	1000	avgt	10	7206.265	1988.927	224.732	ns/op	3.623192304
Base64Encode.testBase64Encode	20000	avgt	10	135695.875	33930.292	97.85	ns/op	3.99925456

vector with only m1

Benchmark	(maxNumBytes)	Mode	Cnt	Score -intrinsic	Score +intrinsic, m1	Error	Units	-intrinsic/+intrinsic
Base64Encode.testBase64Encode	1	avgt	10	86.837	87.137	0.527	ns/op	0.9965571456
Base64Encode.testBase64Encode	2	avgt	10	94.723	94.125	5.122	ns/op	1.006353254
Base64Encode.testBase64Encode	3	avgt	10	121.51	123.082	0.854	ns/op	0.9872280268
Base64Encode.testBase64Encode	6	avgt	10	139.045	137.175	0.201	ns/op	1.013632222
Base64Encode.testBase64Encode	7	avgt	10	161.216	159.387	2.385	ns/op	1.011475214
Base64Encode.testBase64Encode	9	avgt	10	160.541	154.19	1.665	ns/op	1.041189442
Base64Encode.testBase64Encode	10	avgt	10	184.874	174.766	5.569	ns/op	1.057837337
Base64Encode.testBase64Encode	48	avgt	10	405.124	199.333	1.584	ns/op	2.032398047
Base64Encode.testBase64Encode	512	avgt	10	3659.335	1185.626	24.686	ns/op	3.086415952
Base64Encode.testBase64Encode	1000	avgt	10	7239.269	2164.709	1022.367	ns/op	3.344222711
Base64Encode.testBase64Encode	20000	avgt	10	135048.828	38248.645	319.978	ns/op	3.530813392

Progress

[ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
[x] Change must not contain extraneous whitespace
[x] Commit message must refer to an issue

Issue

JDK-8314125: RISC-V: implement Base64 intrinsic - encoding (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19973/head:pull/19973
$ git checkout pull/19973

Update a local copy of the PR:
$ git checkout pull/19973
$ git pull https://git.openjdk.org/jdk.git pull/19973/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 19973

View PR using the GUI difftool:
$ git pr show -t 19973

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19973.diff

Webrev

Link to Webrev Comment

Jul 01 '24 14:07 Hamlin-Li

:wave: Welcome back mli! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Jul 01 '24 14:07 bridgekeeper[bot]

@Hamlin-Li This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8314125: RISC-V: implement Base64 intrinsic - encoding

Reviewed-by: fyang, luhenry

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 59 new commits pushed to the master branch:

0c1e9111d226b601236b9826e27ecc67a8b625fb: 8338019: Fix simple -Wzero-as-null-pointer-constant warnings in riscv code
55c509708e9b89a7609fd41b6e5a271f250bbacd: 8337331: crash: pinned virtual thread will lead to jvm crash when running with the javaagent option
9f08a01cb6ebb08f67749aabdff4efaedfaf3228: 8338010: WB_IsFrameDeoptimized miss ResourceMark
e7a0b5b09bcfcd8b09667e51ec588e206f0652ff: 8334780: Crash: assert(h_array_list.not_null()) failed: invariant
bfb75b96266f4c5840e2edea13f9aa2c801518cf: 8336926: jdk/internal/util/ReferencedKeyTest.java can fail with ConcurrentModificationException
53c9f037fb2805aab5ab9729166ce7d5bc7a77f9: 8336849: Remove .llvm_addrsig section from JDK/VM static libraries (.a files)
9695f09581bac856ad97943cca15c65dc21d2adf: 8337683: Fix -Wconversion problem with arrayOop.hpp
78773b1769922611318243b6680d59518ed4e015: 8338015: Fix "Java Java" typo in package info file of java.lang.classfile
6a9a867d645b8fe86f4ca2b04a43bf5aa8f9d487: 8337994: [REDO] Native memory leak when not recording any events
4b740d87ee50ba49305add4aae6490bebb6963ba: 8225209: jdk/jfr/event/compiler/TestCodeSweeper.java fails
... and 49 more: https://git.openjdk.org/jdk/compare/be34730fb4e6818ac13c46b34b735c967351e5cd...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

Jul 01 '24 14:07 openjdk[bot]

@Hamlin-Li The following label will be automatically applied to this pull request:

hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

Jul 01 '24 14:07 openjdk[bot]

Webrevs

Jul 01 '24 14:07 mlbridge[bot]

with pure scalar impelmentation, it also bring some performance imrpovement in all source size, so also enable the intrinsic when rvv is not supported.

performance data

Benchmark	(maxNumBytes)	Mode	Cnt	Score -intrinsic	Score +instrinsic, scalar	Error	Units	Perf opt
Base64Encode.testBase64Encode	1	avgt	10	86.784	86.75	0.38	ns/op	1
Base64Encode.testBase64Encode	2	avgt	10	93.71	93.824	1.954	ns/op	0.999
Base64Encode.testBase64Encode	3	avgt	10	121.824	123.487	0.559	ns/op	0.987
Base64Encode.testBase64Encode	6	avgt	10	138.984	137.697	0.273	ns/op	1.009
Base64Encode.testBase64Encode	7	avgt	10	161.243	157.696	0.875	ns/op	1.022
Base64Encode.testBase64Encode	9	avgt	10	169.724	155.223	1.908	ns/op	1.093
Base64Encode.testBase64Encode	10	avgt	10	185.92	176.339	5.875	ns/op	1.054
Base64Encode.testBase64Encode	48	avgt	10	408.467	347.269	1.799	ns/op	1.176
Base64Encode.testBase64Encode	512	avgt	10	3665.34	2718.442	26.954	ns/op	1.348
Base64Encode.testBase64Encode	1000	avgt	10	7022.025	5290.003	33.216	ns/op	1.327
Base64Encode.testBase64Encode	20000	avgt	10	135819.7	101988.94	2209.887	ns/op	1.332

Jul 01 '24 15:07 Hamlin-Li

@Hamlin-Li Hi, we looked at RVV base64 encode/decode for another project before, however there wasn't one implementation that obviously was best across the different hardware: https://github.com/WojciechMula/base64simd/issues/9 (see issue for benchmark, and repo for code)

I think we currently can't tell how, the complex load/stores will perform on future hardware. Segmented load/stores for example are quite fast on the current in-order RVV 1.0 boards, however it's very slow on the ooo C910, and XiangShan (current master, may change) cores (SiFive P670 LLVM-MCA indicates that it might also be slow on that core). I'm not sure if that is because they are ooo and that gives you additional constraints, but I wouldn't rely on it just yet.

I think the safest bet for encode would be for now "RISC-V RVV (LMUL=1)" (encode + lookup_pshufb_improved), as this only uses instructions with predictable performance, except for LMUL=1 vrgather.vv, which I think will need to be fast on any application class core. (See x86 equivalent vperm*)

For decode, I'm not really happy with any implementation. Yours uses multiple vluxei8 + vlsege4 + vssege3, the others from base64simd use LMUL=8 vrgather.vv, which will take LMUL^2=8^2=64 times the amount of cycles a LMUL=1 vrgather.vv takes (on sane implementations, see my reasoning). As I said, I'm fairly certain LMUL=1 vrgather.vv will have to be relatively fast, so if I had to choose, I'd prefer my implementation that uses LMUL=1 vrgather.vvs + vlsege4 + vssege3, but using vsseg* is not ideal. (Note that gcc currently chokes on the register allocation, so you should use clang for now)

Jul 04 '24 17:07 camel-cdr

Thanks a lot for sharing the information.

@Hamlin-Li Hi, we looked at RVV base64 encode/decode for another project before, however there wasn't one implementation that obviously was best across the different hardware: WojciechMula/base64simd#9 (see issue for benchmark, and repo for code)

Agree, I think your observation is right.

I think we currently can't tell how, the complex load/stores will perform on future hardware. Segmented load/stores for example are quite fast on the current in-order RVV 1.0 boards, however it's very slow on the ooo C910, and XiangShan (current master, may change) cores (SiFive P670 LLVM-MCA indicates that it might also be slow on that core). I'm not sure if that is because they are ooo and that gives you additional constraints, but I wouldn't rely on it just yet.

I don't know how that (it's very slow on the ooo) happens and currently I don't have these types of machine. And it's bit strange that they are very slow with those instructions, could it be that they are not fully optimized for those instructions on these machines?

I think the safest bet for encode would be for now "RISC-V RVV (LMUL=1)" (encode + lookup_pshufb_improved), as this only uses instructions with predictable performance, except for LMUL=1 vrgather.vv, which I think will need to be fast on any application class core. (See x86 equivalent vperm*)

My current tests on k230 shows that m2+m1+scalar bring the best performance on all size values, I'd like to see test data on other hardwares if someone can help test and get the data. And, for current implementation it's easy to adjust lmul value in the algorithm. So I'm flexiable to either lmul value based the test data.

Jul 05 '24 13:07 Hamlin-Li

For decode, I'm not really happy with any implementation. Yours uses multiple vluxei8 + vlsege4 + vssege3, the others from base64simd use LMUL=8 vrgather.vv, which will take LMUL^2=8^2=64 times the amount of cycles a LMUL=1 vrgather.vv takes (on sane implementations, see my reasoning). As I said, I'm fairly certain LMUL=1 vrgather.vv will have to be relatively fast, so if I had to choose, I'd prefer my implementation that uses LMUL=1 vrgather.vvs + vlsege4 + vssege3, but using vsseg* is not ideal. (Note that gcc currently chokes on the register allocation, so you should use clang for now)

I import your implementation into jdk, but compared to my current decode implementation, it brings much regression. Let's discuss about decode in https://github.com/openjdk/jdk/pull/20026.

Jul 05 '24 13:07 Hamlin-Li

Hi, Can I get another review of this pr? Thanks!

Jul 22 '24 07:07 Hamlin-Li

@Hamlin-Li this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout baes64-encode-integrated
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

Jul 26 '24 07:07 openjdk[bot]

Hi, will take a look. BTW: Have you resolved the performance issue of base64 decode instrinsic?

Jul 29 '24 05:07 RealFYang

Hi, will take a look.

Thanks!

BTW: Have you resolved the performance issue of base64 decode instrinsic?

It only bring regression in MIME cases, I did some investigation, but not found root cause yet.

Jul 29 '24 14:07 Hamlin-Li

Hey, Is anyone available to take another look? Thanks!

Aug 08 '24 12:08 Hamlin-Li

Hi, Sorry for the late reply. Severa comments remain after a second look. Thanks.

No worry. :) Thanks for reviewing.

Aug 08 '24 20:08 Hamlin-Li

@RealFYang @luhenry Thanks for your reviewing.

/integrate

Aug 09 '24 09:08 Hamlin-Li

Going to push as commit c37e8638c98cb4516569304e9a0ab477affb0641. Since your change was applied there have been 66 commits pushed to the master branch:

6ebd5d74d57b334e7cf0b1282d7bb469a56fb3d6: 8338036: Serial: Remove Generation::update_counters
53fce38a3ff8700fef640fffc066efc21ff9c25f: 8338062: JFR: Remove TestStartDuration.java and TestStartName.java from ProblemList.txt
f74109bd178c92a9dff1ca6fce03b25f51a0384f: 8337939: ZGC: Make assertions and checks less convoluted and explicit
82d5768c1bdccfaf97a41f32a8bfcfd03a0fb378: 8337840: Remove redundant null check in ObjectOutputStream.writeProxyDesc
c01f53ac2dab1d4d2cd1e4d45a67f9373d4a9c7e: 8337876: [IR Framework] Add support for IR tests with @Stable
00aac4097abd3c5e6144734cfd44228bc31892fb: 8338058: map_or_reserve_memory_aligned Windows enhance remap assertion
9ab8c6b9ba90ffd12600a250c8704571e9feb78d: 8335130: The test "javax/swing/plaf/synth/ComponentsOrientationSupport/5033822/bug5033822.java" fails because the background color of the tabs is displayed incorrectly.
0c1e9111d226b601236b9826e27ecc67a8b625fb: 8338019: Fix simple -Wzero-as-null-pointer-constant warnings in riscv code
55c509708e9b89a7609fd41b6e5a271f250bbacd: 8337331: crash: pinned virtual thread will lead to jvm crash when running with the javaagent option
9f08a01cb6ebb08f67749aabdff4efaedfaf3228: 8338010: WB_IsFrameDeoptimized miss ResourceMark
... and 56 more: https://git.openjdk.org/jdk/compare/be34730fb4e6818ac13c46b34b735c967351e5cd...master

Your commit was automatically rebased without conflicts.

Aug 09 '24 09:08 openjdk[bot]

@Hamlin-Li Pushed as commit c37e8638c98cb4516569304e9a0ab477affb0641.

:bulb: You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Aug 09 '24 09:08 openjdk[bot]

jdk jdk copied to clipboard

8314125: RISC-V: implement Base64 intrinsic - encoding

Test

Progress

Issue

Reviewing

Webrev

Webrevs

jdk
jdk copied to clipboard