jdk 8332265: RISC-V: Materialize pointers faster by using a temp register

Hi, please consider!

Materializing a 48-bit pointer, using an additional register, we can do with: lui + lui + slli + add + addi This 15% faster both on VF2 and in CPU models, compared to movptr().

As we often materialize during calls there is free registers.

I have choose just a few spot to use it, many more can use. E.g. la() with tmp register can use li48 instead of movptr.

Running tests now (so far so good), as if I screwed up IC calls it should be seen fast. And benchmarks when hardware is free.

Progress

[ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
[x] Change must not contain extraneous whitespace
[x] Commit message must refer to an issue

Issue

JDK-8332265: RISC-V: Materialize pointers faster by using a temp register (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19246/head:pull/19246
$ git checkout pull/19246

Update a local copy of the PR:
$ git checkout pull/19246
$ git pull https://git.openjdk.org/jdk.git pull/19246/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 19246

View PR using the GUI difftool:
$ git pr show -t 19246

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19246.diff

Webrev

Link to Webrev Comment

May 15 '24 09:05 robehn

:wave: Welcome back rehn! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

May 15 '24 09:05 bridgekeeper[bot]

@robehn This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8332265: RISC-V: Materialize pointers faster by using a temp register

Reviewed-by: fyang, luhenry, mli

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the master branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

May 15 '24 09:05 openjdk[bot]

@robehn The following label will be automatically applied to this pull request:

hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

May 15 '24 09:05 openjdk[bot]

Webrevs

May 15 '24 09:05 mlbridge[bot]

Yes, but it's a long term job, as you need to free a register in many cases. (in non-call sites places) All callsites should be easy to change as you have plenty of callee saved registers which are already saved when using movptr.

As li48 is faster than li when using more than 32-bits these cases should also use li48. I.e. mv t0, addr But mv is fishy partly because of RegisterOrConstant constructor, so we can't tell in mv if this was an address or not. I have been looking into cleaning that up, so mv with literal and mv with address is two seperate cases. To keep them apart would be to use e.g. "li reg, literal" and "li48 reg, temp_reg, address".

As there is much work, this PR is intended as the first step with the hardest peices implemented already, i.e. li48 is ready to go.

If we also fix mov_metadata la()->li48 we reduce static call stub size down from 12 to 10 instruction, which is significant. That one is on my todo list.

May 16 '24 07:05 robehn

Yes, but it's a long term job, as you need to free a register in many cases. (in non-call sites places) All callsites should be easy to change as you have plenty of callee saved registers which are already saved when using movptr.

OK, I guess this might be a good compromise. Inspired by PPC's Assembler::load_const, MacroAssembler::get_const and MacroAssembler::patch_const [1-3], I think we could have a similar design. Adding one extra tmp register param for movptr like void movptr(Register Rd, address addr, int32_t &offset, Register tmp=noreg);, we can factor out li48 then. The only difference compared with PPC's solution is that that we will have different sizes depending on whether we could find a tmp register for movptr. But I guess that's not a big issue? We can add a reference param (say, size) to existing is_movptr_at/is_movptr, we get the correct size when checking the instruction sequence and return this in size. I prefer this way as I don't think it's a good idea to have more li variants like li48 and it's really bad to have both movptr and li48 which duplicate the functionality. We should also remove the existing li64 which is not used anywhere. Could you please consider this?

[1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/ppc/assembler_ppc.cpp#L323 [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp#L327 [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp#L349

May 20 '24 02:05 RealFYang

Hey, I did an update, not fully what you are saying. We are missing a lot of 'abstraction' e.g. take a look at CodeInstaller::pd_next_offset in jvmciCodeInstaller. I think this code should look like:

jint CodeInstaller::pd_next_offset(NativeInstruction* inst, jint pc_offset, JVMCI_TRAPS) {
  if(inst->is_call() || inst->is_jump() || inst->is_movptr()) {
    return pc_offset + inst->size();
  }
  JVMCI_ERROR_0("unsupported type of instruction for call site");
}

But I need to add a bunch of stuff to unrelated NativeInst, I think that is better suited in another PR.

May 20 '24 13:05 robehn

Thanks @luhenry ! Thanks for the second review pass @RealFYang !

May 22 '24 08:05 robehn

Yes, thanks, done!

May 22 '24 18:05 robehn

Thanks again!

May 23 '24 11:05 robehn

And a general question about nativeInst_riscv.cpp and macroAssembler_riscv.cpp. I saw the functions in these 2 files call each other, that make the code a bit mess to me. It's not an issue introduced in this pr. I wonder if this could be refactored? If so, I can work on it. But just in case you have easy answer already, so I don't have to do further investigation.

May 24 '24 10:05 Hamlin-Li

Updated change looks good. It would be nice to see how much this will benefit performance.

I tried todo some benchmarks but it seems like the error of them are larger than the benefit. I'll try todo some longer runs, and minimize the error.

May 24 '24 11:05 robehn

And a general question about nativeInst_riscv.cpp and macroAssembler_riscv.cpp. I saw the functions in these 2 files call each other, that make the code a bit mess to me. It's not an issue introduced in this pr. I wonder if this could be refactored? If so, I can work on it. But just in case you have easy answer already, so I don't have to do further investigation.

I agree, I would prefer having classes for the instruction where all the instruction functionality would be. As it's now the opcodes are reapted everywhere, instead it should just be in in-place, this class. And then have classes for instruction sequence where we keep all functionality gathered.

May 24 '24 12:05 robehn

And a general question about nativeInst_riscv.cpp and macroAssembler_riscv.cpp. I saw the functions in these 2 files call each other, that make the code a bit mess to me. It's not an issue introduced in this pr. I wonder if this could be refactored? If so, I can work on it. But just in case you have easy answer already, so I don't have to do further investigation.

I agree, I would prefer having classes for the instruction where all the instruction functionality would be. As it's now the opcodes are reapted everywhere, instead it should just be in in-place, this class. And then have classes for instruction sequence where we keep all functionality gathered.

OK, let me do some further investigation to see if we can make it more readable and maintainable.

May 24 '24 13:05 Hamlin-Li

OK, let's move forward. :) I create some bugs to track the further work. https://bugs.openjdk.org/browse/JDK-8332899 https://bugs.openjdk.org/browse/JDK-8332900

Feel free to take them if you're also interested in them.

Thank you!

May 24 '24 13:05 robehn

Updated change looks good. It would be nice to see how much this will benefit performance.

Here are 'some' number, it still unclear if these actually are significant:

BASELINE                       | movptr2
fop                  3120 msec |  2811 msec =0.900962
h2                  19156 msec | 17600 msec =0.918772
jython              24060 msec | 23343 msec =0.9702
luindex              3222 msec |  3226 msec =1.00124
lusearch             4383 msec |  4380 msec =0.999316
lusearch-fix         4096 msec |  4359 msec =1.06421
pmd                  7417 msec |  7342 msec =0.989888
jython              24060 msec | 23343 msec =0.9702
fop(Xcomp)           3060 msec |  3058 msec =0.999346
h2(Xcomp)           38724 msec | 38717 msec =0.999819
jython(Xcomp)       29999 msec | 29694 msec =0.989833
luindex(Xcomp)       5259 msec |  5195 msec =0.98783
lusearch(Xcomp)      6364 msec |  6269 msec =0.985072
lusearch-fix(Xcomp)  6430 msec |  6534 msec =1.01617
pmd(Xcomp)           7360 msec |  6999 msec =0.950951
jython(Xcomp)       29999 msec | 29694 msec =0.989833
Avg:0.983353

The issue is that reaching steady state AKA varmup takes tremondiusly long time. As it takes long time it causes compilations differences between runs and it's unclear if we actually have finished varmup. Hence I run with Xcomp to get a more stable result.

Integrating later today!

May 28 '24 06:05 robehn

/integrate

May 28 '24 12:05 robehn

Going to push as commit 7b52d0acfc7d6083b407efa0877c139e9837f86b. Since your change was applied there have been 23 commits pushed to the master branch:

aa4c83a5bfe146714a46fb454aafc7393d2d8453: 8332505: JEP 457: ClassRemapper forgets to remap bootstrap method references
cabe337400a0bd61d73bf3ca66e16266267299c7: 8331921: Hotspot assembler files should use common logic to setup exported functions
2edb6d98133d8bd6dc4527c7497c460283fdc53e: 8330386: Replace Opaque4Node of Initialized Assertion Predicate with new OpaqueInitializedAssertionPredicateNode
1850914380655ef3d08614a5656e7cc23478f38f: 8332864: Parallel: Merge ParMarkBitMapClosure into MoveAndUpdateClosure
2f2cf38bb5cecea698e519396574343cfbe4f359: 8332883: Some simple cleanup in vectornode.cpp
b5e1615c0084538f2161fe9b56748d188983e972: 8292955: Collections.checkedMap Map.merge does not properly check key and value
86eb5d9f3be30ff9df1318f18ab73c7129c978f6: 8329958: Windows x86 build fails: downcallLinker.cpp(36) redefinition
be1d374bc54d43aae3b3c1feace22d38fe2156b6: 8332825: ubsan: guardedMemory.cpp:35:11: runtime error: null pointer passed as argument 2, which is declared to never be null
ed81a478e175631f1de69eb4b43f927629fefd74: 8327964: Simplify BigInteger.implMultiplyToLen intrinsic
08face8c4cd2d0b6f48f99bae5a380b7f7e4f2c2: 8332890: Module imports don't work inside the same module
... and 13 more: https://git.openjdk.org/jdk/compare/16dba04e8dfa871f8056480a42a9baeb24a2fb24...master

Your commit was automatically rebased without conflicts.

May 28 '24 12:05 openjdk[bot]

@robehn Pushed as commit 7b52d0acfc7d6083b407efa0877c139e9837f86b.

:bulb: You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

May 28 '24 12:05 openjdk[bot]