jdk 8332689: RISC-V: Use load instead of trampolines

trafficstars

Hi all, please consider!

Today we do JAL to dest if dest is in reach (+/- 1 MB). Using a very small application or running very short time we have fast patchable calls. But any normal application running longer will increase the code size and code chrun/fragmentation. So whatever or not you get hot fast calls rely on luck.

To be patchable and get code cache reach we also emit a stub trampoline which we can point the JAL to. This would be the common case for a patchable call.

Code stream:
JAL <trampo>
Stubs:
AUIPC
LD
JALR
<DEST>

On some CPUs L1D and L1I can't contain the same cache line, which means the tramopline stub can bounce from L1I->L1D->L1I, which is expensive. Even if you don't have that problem having a call to a jump is not the fastest way. Loading the address avoids the pitsfalls of cmodx.

This patch suggest to solve the problems with trampolines, we take small penalty in the naive case of JAL to dest, and instead do by default:

Code stream:
AUIPC
LD
JALR
Stubs:
<DEST>

An experimental option for turning trampolines back on exists.

It should be possible to enhanced this with the WIP Zjid by changing the JALR to JAL and nop out the auipc+ld (as the current proposal of Zjid forces the I-fetcher to fetch instruction in order (meaning we will avoid a lot issues which arm has)) when in reach and vice-versa.

Numbers from VF2 (I have done them a few times, they are always overall in favor of this patch):

fop                                        (msec)    2239       |  2128       =  0.950424
h2                                         (msec)    18660      |  16594      =  0.889282
jython                                     (msec)    22022      |  21925      =  0.995595
luindex                                    (msec)    2866       |  2842       =  0.991626
lusearch                                   (msec)    4108       |  4311       =  1.04942
lusearch-fix                               (msec)    4406       |  4116       =  0.934181
pmd                                        (msec)    5976       |  5897       =  0.98678
jython                                     (msec)    22022      |  21925      =  0.995595
Avg:                                       0.974112                              
fop(xcomp)                                 (msec)    2721       |  2714       =  0.997427
h2(xcomp)                                  (msec)    37719      |  38004      =  1.00756
jython(xcomp)                              (msec)    28563      |  29470      =  1.03175
luindex(xcomp)                             (msec)    5303       |  5512       =  1.03941
lusearch(xcomp)                            (msec)    6702       |  6271       =  0.935691
lusearch-fix(xcomp)                        (msec)    6721       |  6217       =  0.925011
pmd(xcomp)                                 (msec)    6835       |  6587       =  0.963716
jython(xcomp)                              (msec)    28563      |  29470      =  1.03175
Avg:                                       0.99154                               
o.r.actors.JmhAkkaUct.run                  (ms/op)   8585.440   |  7548.347   =  0.879203
o.r.actors.JmhReactors.run                 (ms/op)   65004.694  |  64448.824  =  0.991449
o.r.jdk.concurrent.JmhFjKmeans.run         (ms/op)   47751.653  |  45747.490  =  0.958029
o.r.jdk.concurrent.JmhFutureGenetic.run    (ms/op)   12083.628  |  11427.650  =  0.945713
o.r.jdk.streams.JmhMnemonics.run           (ms/op)   32691.025  |  31002.088  =  0.948336
o.r.jdk.streams.JmhParMnemonics.run        (ms/op)   27500.431  |  23747.117  =  0.863518
o.r.jdk.streams.JmhScrabble.run            (ms/op)   3688.182   |  3528.943   =  0.956825
o.r.neo4j.JmhNeo4jAnalytics.run            (ms/op)   20153.371  |  21704.731  =  1.07698
o.r.rx.JmhRxScrabble.run                   (ms/op)   1197.749   |  1160.465   =  0.968872
o.r.scala.dotty.JmhDotty.run               (ms/op)   18385.552  |  18561.341  =  1.00956
o.r.scala.sat.JmhScalaDoku.run             (ms/op)   25243.887  |  22112.289  =  0.875946
o.r.scala.stdlib.JmhScalaKmeans.run        (ms/op)   2610.509   |  2498.539   =  0.957108
o.r.scala.stm.JmhPhilosophers.run          (ms/op)   5875.997   |  6101.689   =  1.03841
o.r.scala.stm.JmhScalaStmBench7.run        (ms/op)   8723.122   |  8760.115   =  1.00424
o.r.twitter.finagle.JmhFinagleChirper.run  (ms/op)   21209.541  |  21732.213  =  1.02464
o.r.twitter.finagle.JmhFinagleHttp.run     (ms/op)   20782.221  |  20390.960  =  0.981173
Avg:                                       0.9675

It's been throught a couple of t1-t3, but I need to re-run test after latest merge.

Progress

[ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
[x] Change must not contain extraneous whitespace
[x] Commit message must refer to an issue

Issue

JDK-8332689: RISC-V: Use load instead of trampolines (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19453/head:pull/19453
$ git checkout pull/19453

Update a local copy of the PR:
$ git checkout pull/19453
$ git pull https://git.openjdk.org/jdk.git pull/19453/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 19453

View PR using the GUI difftool:
$ git pr show -t 19453

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19453.diff

Webrev

Link to Webrev Comment

May 29 '24 12:05 robehn

:wave: Welcome back rehn! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

May 29 '24 12:05 bridgekeeper[bot]

@robehn This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8332689: RISC-V: Use load instead of trampolines

Reviewed-by: fyang, mli, luhenry

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 9 new commits pushed to the master branch:

b32e4a68bca588d908bd81a398eb3171a6876dc5: 8335356: Shenandoah: Improve concurrent cleanup locking
62cbf70346e78ca94ce6ea4ba5a308ea0a2bbfa8: 8336085: Fix simple -Wzero-as-null-pointer-constant warnings in CDS code
2928753bd95356467e4fe42ee391e45d1cb6e89c: 8324966: Allow selecting jtreg test case by ID from make
1772a929af0c31bf22153cc19c5d11b00273453b: 8334457: Test javax/swing/JTabbedPane/bug4666224.java fail on macOS with because pressing the ‘C’ key does not switch the layout to WRAP_TAB_LAYOUT
b7d0eff5ad77e338b237773d2fc047eea3d2ac12: 8207908: JMXStatusTest.java fails assertion intermittently
cf940e139a76e5aabd52379b8a87065d82b2284c: 8335553: [Graal] Compiler thread calls into jdk.internal.vm.VMSupport.decodeAndThrowThrowable and crashes in OOM situation
b363de8c9fbf7d9e4aade41a2e883cc83ced320b: 8335946: DTrace code snippets should be generated when DTrace flags are enabled
d6c6847e32673d36a1958cefd1851ec9f3b1e2ad: 8335743: jhsdb jstack cannot print some information on the waiting thread
cad68e06ecad1e19091d1af9c0f9b8145d6842fb: 8335935: Chained builders not sending transformed models to next transforms

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch. As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

May 29 '24 12:05 openjdk[bot]

@robehn The following label will be automatically applied to this pull request:

hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

May 29 '24 12:05 openjdk[bot]

Webrevs

May 29 '24 14:05 mlbridge[bot]

Interesting. I will take a look and play with it on my machines.

May 31 '24 06:05 RealFYang

Interesting. I will take a look and play with it on my machines.

Thanks. I don't expected this will be done before RDP1, so take your time.

Note that this also somewhat fixes issues with out of order machines spectulation. I.e. the dest in the trampoline may very well be I-fetched and decoded which may trigger pipeline flush or similar if it's decoded as something crazy to the CPU. (obviously very implementation dependent)

(to be guaranteed to have fixed that issue completely we need to put the data on a page without execute)

May 31 '24 06:05 robehn

I see new classes are added in nativeInst, maybe the comments at the top of nativeInst.hpp needs updated accordingly.

Jun 04 '24 11:06 Hamlin-Li

I see new classes are added in nativeInst, maybe the comments at the top of nativeInst.hpp needs updated accordingly.

They are all private, moved them to cpp.

Jun 04 '24 16:06 robehn

It just passed t1(+UseTramp) and t1-3, I'll restart due to the large merge.

Jun 04 '24 16:06 robehn

I sent a patch addressing @RealFYang concern. (https://github.com/openjdk/jdk/pull/19556) And I update the code in preparation for that, so above PR goes in first and when I merge with it the concern should be addressed. (We are assuming a JVM CI will respect the calling convetion flag UseTrampolines)

Jun 05 '24 13:06 robehn

Mailing list message from Andrew Haley on hotspot-dev:

On 5/29/24 15:28, Robbin Ehn wrote:

On some CPUs L1D and L1I can't contain the same cache line, which means the tramopline stub can bounce from L1I->L1D->L1I, which is expensive.

Wouldn't it be a lot easier simply to put the target address loaded by the trampoline into the constant pool?

Even if you don't have that problem having a call to a jump is not the fastest way.

I guess the real problem there is that the jalr #imm range is pretty short, so taking a trampoline is a very common case, and you have to optimize for that. On AArch64 we optimize for the simple branch.

BTW, on AArch64 we don't have all the problems described in Zjid. For example, if you modify code, do the icache invalidate dance, then patch a jump so that it points to the newly-modified code, every observer sees the new code that was modified before it sees the patched jump.

-- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

Jun 07 '24 17:06 mlbridge[bot]

Mailing list message from Andrew Haley on hotspot-dev:

On 5/29/24 15:28, Robbin Ehn wrote:

On some CPUs L1D and L1I can't contain the same cache line, which means the tramopline stub can bounce from L1I->L1D->L1I, which is expensive.

Wouldn't it be a lot easier simply to put the target address loaded by the trampoline into the constant pool?

Seem to me that will be more cleaner than the current solution (MacroAssembler::emit_address_stub which uses trampoline_stub_Relocation::spec relocation holder but emits an 'address stub' instead of a real trampline). And I see PPC is putting the entry point as a constant into the constant pool [1] when emitting a call with trampoline stub.

[1] MacroAssembler::emit_address_stub

Jun 10 '24 03:06 RealFYang

Mailing list message from Andrew Haley on hotspot-dev: On 5/29/24 15:28, Robbin Ehn wrote:

On some CPUs L1D and L1I can't contain the same cache line, which means the tramopline stub can bounce from L1I->L1D->L1I, which is expensive.

Wouldn't it be a lot easier simply to put the target address loaded by the trampoline into the constant pool?

Seem to me that will be more cleaner than the current solution (MacroAssembler::emit_address_stub which uses trampoline_stub_Relocation::spec relocation holder but emits an 'address stub' instead of a real trampline). And I see PPC is putting the entry point as a constant into the constant pool [1] when emitting a call with trampoline stub.

[1] MacroAssembler::emit_address_stub

This was just a bit easier as I have both cases. I'll look into cp.

Thanks

Jun 10 '24 05:06 robehn

Mailing list message from Andrew Haley on hotspot-dev: On 5/29/24 15:28, Robbin Ehn wrote:

On some CPUs L1D and L1I can't contain the same cache line, which means the tramopline stub can bounce from L1I->L1D->L1I, which is expensive.

Wouldn't it be a lot easier simply to put the target address loaded by the trampoline into the constant pool?

Seem to me that will be more cleaner than the current solution (MacroAssembler::emit_address_stub which uses trampoline_stub_Relocation::spec relocation holder but emits an 'address stub' instead of a real trampline). And I see PPC is putting the entry point as a constant into the constant pool [1] when emitting a call with trampoline stub. [1] MacroAssembler::emit_address_stub

This was just a bit easier as I have both cases. I'll look into cp.

Thanks

ppc version is not possible as CP offset is stored in instruction stream. We need to use runtime_call_w_cp_type, as s390, to keep track of the offset when CP moves around during growth/relocation. It needs a bit work. What you think?

Jun 11 '24 11:06 robehn

Mailing list message from Andrew Haley on hotspot-dev: On 5/29/24 15:28, Robbin Ehn wrote:

On some CPUs L1D and L1I can't contain the same cache line, which means the tramopline stub can bounce from L1I->L1D->L1I, which is expensive.

Wouldn't it be a lot easier simply to put the target address loaded by the trampoline into the constant pool?

Seem to me that will be more cleaner than the current solution (MacroAssembler::emit_address_stub which uses trampoline_stub_Relocation::spec relocation holder but emits an 'address stub' instead of a real trampline). And I see PPC is putting the entry point as a constant into the constant pool [1] when emitting a call with trampoline stub. [1] MacroAssembler::emit_address_stub

This was just a bit easier as I have both cases. I'll look into cp. Thanks

ppc version is not possible as CP offset is stored in instruction stream. We need to use runtime_call_w_cp_type, as s390, to keep track of the offset when CP moves around during growth/relocation. It needs a bit work. What you think?

Ah, that doesn't seems to be easier than before ~. I am not familar with s390 and I didn't see where it updates the target address in CP. I will take another look at our current approach and see if there are further improvements we can do.

Jun 12 '24 07:06 RealFYang

Ah, that doesn't seems to be easier than before ~. I am not familar with s390 and I didn't see where it updates the target address in CP. I will take another look at our current approach and see if there are further improvements we can do.

I have tested a few different approaches, as we are crossing code sections we need to track this. The two which seems most reasoanble is:

Add offset to the four relocators static/virt/opt/runtime as RV only. When we set the destination on relocation we update the instructions to point to CP + offset (where we get offset from relocator), and finally set the new destination at this location.
Add a second relocator to the call which tracks the constant pool address. When we set the destination on relocation we lookup the second relocator and get the address from it, and finally set the dest at this location.
Maybe there is a smarter way not apparent to me?

As there is a lot of assumptions about relocators both requires random shared code changes. E.g. CodeBuffers with 0 sized CP, assuming only one relocator for PC-range, etc...

I suggest this require a more elaborate fix as I would want to add support for global table (in range of auipc+ld from CC) where we can put shared addresses. I think the second solution with additional relocator where we just get the address, which then could either be local CP or global CP would be best? Hence I'm suggesting this work would be done outside this PR, and suggest we keep using stubs for now.

Cleaned up a patch for using dual relocators (very similar solution to trampoline relocator), I guess it turn out not too bad: https://github.com/robehn/jdk/compare/8332689...robehn:jdk:8332689_cp

Jun 13 '24 07:06 robehn

FYI, if nothing else as reference, rebased incremental patched using constant pool here (sanity tested): https://github.com/robehn/jdk/compare/8332689...robehn:jdk:8332689_cp

Jun 26 '24 12:06 robehn

If there is no major issues, I suggest we should consider ship now. As it is early in the cycle this will get a lot of bake-time, and there will be plenty of time to do additional changes, and even possible to change the default to trampoline calls.

I'll re-start all testing once again :)

Jul 04 '24 06:07 robehn

Thanks @Hamlin-Li @RealFYang !

Jul 04 '24 14:07 robehn

I have not seen (new) issues in testing. I would have prefered one or two more reviewers, but since RV is not the biggest platform I'll settle with just passing the bar. I'll go ahead and integrate if @RealFYang and @Hamlin-Li re-reviews (as the new rules are in-effect which require latest rev to be reviewed).

Jul 10 '24 11:07 robehn

Also performed tier1-3 and hotspot:tier4 on my unmatched boards. Result looks fine. Just witnessed several unnecessary uses of namespace Assembler. Guess you might want to clean it up? Still good otherwise.

diff --git a/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp b/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp
index b39ac79be6b..e349eab3177 100644
--- a/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp
+++ b/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp
@@ -983,9 +983,9 @@ void MacroAssembler::load_link_jump(const address source, Register temp) {
   assert_cond(source != nullptr);
   int64_t distance = source - pc();
   assert(is_simm32(distance), "Must be");
-  Assembler::auipc(temp, (int32_t)distance + 0x800);
-  Assembler::ld(temp, temp, ((int32_t)distance << 20) >> 20);
-  Assembler::jalr(x1, temp, 0);
+  auipc(temp, (int32_t)distance + 0x800);
+  ld(temp, Address(temp, ((int32_t)distance << 20) >> 20));
+  jalr(temp);
 }

 void MacroAssembler::jump_link(const address dest, Register temp) {
@@ -994,7 +994,7 @@ void MacroAssembler::jump_link(const address dest, Register temp) {
   int64_t distance = dest - pc();
   assert(is_simm21(distance), "Must be");
   assert((distance % 2) == 0, "Must be");
-  Assembler::jal(x1, distance);
+  jal(x1, distance);
 }

 void MacroAssembler::j(const address dest, Register temp) {

Jul 10 '24 14:07 RealFYang

Also performed tier1-3 and hotspot:tier4 on my unmatched boards. Result looks fine. Just witnessed several unnecessary uses of namespace Assembler. Guess you might want to clean it up? Still good otherwise.

Thanks, fixed!

Jul 10 '24 20:07 robehn

I have not seen (new) issues in testing. I would have prefered one or two more reviewers, but since RV is not the biggest platform I'll settle with just passing the bar. I'll go ahead and integrate if @RealFYang and @Hamlin-Li re-reviews (as the new rules are in-effect which require latest rev to be reviewed).

Still good to me. Thanks!

Jul 10 '24 20:07 Hamlin-Li

Thanks! @Hamlin-Li please re-approve, background: https://mail.openjdk.org/pipermail/jdk-dev/2024-July/009199.html

Jul 11 '24 09:07 robehn

Thanks! @Hamlin-Li please re-approve, background: https://mail.openjdk.org/pipermail/jdk-dev/2024-July/009199.html

Done.

Jul 11 '24 10:07 Hamlin-Li

/integrate

Thank you all for sticking with it!

Jul 11 '24 10:07 robehn

Going to push as commit 5c612c230b0a852aed5fd36e58b82ebf2e1838af. Since your change was applied there have been 10 commits pushed to the master branch:

6fcd49f9431cc3507f96ef2acdca43fc6a394a14: 8336239: Fix javadoc markup in java.lang.Process
b32e4a68bca588d908bd81a398eb3171a6876dc5: 8335356: Shenandoah: Improve concurrent cleanup locking
62cbf70346e78ca94ce6ea4ba5a308ea0a2bbfa8: 8336085: Fix simple -Wzero-as-null-pointer-constant warnings in CDS code
2928753bd95356467e4fe42ee391e45d1cb6e89c: 8324966: Allow selecting jtreg test case by ID from make
1772a929af0c31bf22153cc19c5d11b00273453b: 8334457: Test javax/swing/JTabbedPane/bug4666224.java fail on macOS with because pressing the ‘C’ key does not switch the layout to WRAP_TAB_LAYOUT
b7d0eff5ad77e338b237773d2fc047eea3d2ac12: 8207908: JMXStatusTest.java fails assertion intermittently
cf940e139a76e5aabd52379b8a87065d82b2284c: 8335553: [Graal] Compiler thread calls into jdk.internal.vm.VMSupport.decodeAndThrowThrowable and crashes in OOM situation
b363de8c9fbf7d9e4aade41a2e883cc83ced320b: 8335946: DTrace code snippets should be generated when DTrace flags are enabled
d6c6847e32673d36a1958cefd1851ec9f3b1e2ad: 8335743: jhsdb jstack cannot print some information on the waiting thread
cad68e06ecad1e19091d1af9c0f9b8145d6842fb: 8335935: Chained builders not sending transformed models to next transforms

Your commit was automatically rebased without conflicts.

Jul 11 '24 10:07 openjdk[bot]

@robehn Pushed as commit 5c612c230b0a852aed5fd36e58b82ebf2e1838af.

:bulb: You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Jul 11 '24 10:07 openjdk[bot]

jdk jdk copied to clipboard

8332689: RISC-V: Use load instead of trampolines

Progress

Issue

Reviewing

Webrev

Webrevs

jdk
jdk copied to clipboard