8332265: RISC-V: Materialize pointers faster by using a temp register
Hi, please consider!
Materializing a 48-bit pointer, using an additional register, we can do with: lui + lui + slli + add + addi This 15% faster both on VF2 and in CPU models, compared to movptr().
As we often materialize during calls there is free registers.
I have choose just a few spot to use it, many more can use. E.g. la() with tmp register can use li48 instead of movptr.
Running tests now (so far so good), as if I screwed up IC calls it should be seen fast. And benchmarks when hardware is free.
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
Issue
- JDK-8332265: RISC-V: Materialize pointers faster by using a temp register (Enhancement - P4)
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19246/head:pull/19246
$ git checkout pull/19246
Update a local copy of the PR:
$ git checkout pull/19246
$ git pull https://git.openjdk.org/jdk.git pull/19246/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 19246
View PR using the GUI difftool:
$ git pr show -t 19246
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19246.diff
Webrev
:wave: Welcome back rehn! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
@robehn This change now passes all automated pre-integration checks.
ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.
After integration, the commit message for the final commit will be:
8332265: RISC-V: Materialize pointers faster by using a temp register
Reviewed-by: fyang, luhenry, mli
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.
At the time when this comment was updated there had been no new commits pushed to the master branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.
➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.
@robehn The following label will be automatically applied to this pull request:
hotspot
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.
Webrevs
- 06: Full (8dff3f12)
- 05: Full - Incremental (cf4c3066)
- 04: Full (9134d4e8)
- 03: Full - Incremental (e017302b)
- 02: Full - Incremental (c406294a)
- 01: Full - Incremental (edfdda28)
- 00: Full (f5efaca8)
Yes, but it's a long term job, as you need to free a register in many cases. (in non-call sites places) All callsites should be easy to change as you have plenty of callee saved registers which are already saved when using movptr.
As li48 is faster than li when using more than 32-bits these cases should also use li48. I.e. mv t0, addr But mv is fishy partly because of RegisterOrConstant constructor, so we can't tell in mv if this was an address or not. I have been looking into cleaning that up, so mv with literal and mv with address is two seperate cases. To keep them apart would be to use e.g. "li reg, literal" and "li48 reg, temp_reg, address".
As there is much work, this PR is intended as the first step with the hardest peices implemented already, i.e. li48 is ready to go.
If we also fix mov_metadata la()->li48 we reduce static call stub size down from 12 to 10 instruction, which is significant. That one is on my todo list.
Yes, but it's a long term job, as you need to free a register in many cases. (in non-call sites places) All callsites should be easy to change as you have plenty of callee saved registers which are already saved when using movptr.
OK, I guess this might be a good compromise. Inspired by PPC's Assembler::load_const, MacroAssembler::get_const and MacroAssembler::patch_const [1-3], I think we could have a similar design. Adding one extra tmp register param for movptr like void movptr(Register Rd, address addr, int32_t &offset, Register tmp=noreg);, we can factor out li48 then.
The only difference compared with PPC's solution is that that we will have different sizes depending on whether we could find a tmp register for movptr. But I guess that's not a big issue? We can add a reference param (say, size) to existing is_movptr_at/is_movptr, we get the correct size when checking the instruction sequence and return this in size.
I prefer this way as I don't think it's a good idea to have more li variants like li48 and it's really bad to have both movptr and li48 which duplicate the functionality. We should also remove the existing li64 which is not used anywhere. Could you please consider this?
[1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/ppc/assembler_ppc.cpp#L323 [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp#L327 [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp#L349
Hey, I did an update, not fully what you are saying. We are missing a lot of 'abstraction' e.g. take a look at CodeInstaller::pd_next_offset in jvmciCodeInstaller. I think this code should look like:
jint CodeInstaller::pd_next_offset(NativeInstruction* inst, jint pc_offset, JVMCI_TRAPS) {
if(inst->is_call() || inst->is_jump() || inst->is_movptr()) {
return pc_offset + inst->size();
}
JVMCI_ERROR_0("unsupported type of instruction for call site");
}
But I need to add a bunch of stuff to unrelated NativeInst, I think that is better suited in another PR.
Thanks @luhenry ! Thanks for the second review pass @RealFYang !
Yes, thanks, done!
Thanks again!
And a general question about nativeInst_riscv.cpp and macroAssembler_riscv.cpp. I saw the functions in these 2 files call each other, that make the code a bit mess to me. It's not an issue introduced in this pr.
I wonder if this could be refactored? If so, I can work on it. But just in case you have easy answer already, so I don't have to do further investigation.
Updated change looks good. It would be nice to see how much this will benefit performance.
I tried todo some benchmarks but it seems like the error of them are larger than the benefit. I'll try todo some longer runs, and minimize the error.
And a general question about
nativeInst_riscv.cppandmacroAssembler_riscv.cpp. I saw the functions in these 2 files call each other, that make the code a bit mess to me. It's not an issue introduced in this pr. I wonder if this could be refactored? If so, I can work on it. But just in case you have easy answer already, so I don't have to do further investigation.
I agree, I would prefer having classes for the instruction where all the instruction functionality would be. As it's now the opcodes are reapted everywhere, instead it should just be in in-place, this class. And then have classes for instruction sequence where we keep all functionality gathered.
And a general question about
nativeInst_riscv.cppandmacroAssembler_riscv.cpp. I saw the functions in these 2 files call each other, that make the code a bit mess to me. It's not an issue introduced in this pr. I wonder if this could be refactored? If so, I can work on it. But just in case you have easy answer already, so I don't have to do further investigation.I agree, I would prefer having classes for the instruction where all the instruction functionality would be. As it's now the opcodes are reapted everywhere, instead it should just be in in-place, this class. And then have classes for instruction sequence where we keep all functionality gathered.
OK, let me do some further investigation to see if we can make it more readable and maintainable.
OK, let's move forward. :) I create some bugs to track the further work. https://bugs.openjdk.org/browse/JDK-8332899 https://bugs.openjdk.org/browse/JDK-8332900
Feel free to take them if you're also interested in them.
Thank you!
Updated change looks good. It would be nice to see how much this will benefit performance.
Here are 'some' number, it still unclear if these actually are significant:
BASELINE | movptr2
fop 3120 msec | 2811 msec =0.900962
h2 19156 msec | 17600 msec =0.918772
jython 24060 msec | 23343 msec =0.9702
luindex 3222 msec | 3226 msec =1.00124
lusearch 4383 msec | 4380 msec =0.999316
lusearch-fix 4096 msec | 4359 msec =1.06421
pmd 7417 msec | 7342 msec =0.989888
jython 24060 msec | 23343 msec =0.9702
fop(Xcomp) 3060 msec | 3058 msec =0.999346
h2(Xcomp) 38724 msec | 38717 msec =0.999819
jython(Xcomp) 29999 msec | 29694 msec =0.989833
luindex(Xcomp) 5259 msec | 5195 msec =0.98783
lusearch(Xcomp) 6364 msec | 6269 msec =0.985072
lusearch-fix(Xcomp) 6430 msec | 6534 msec =1.01617
pmd(Xcomp) 7360 msec | 6999 msec =0.950951
jython(Xcomp) 29999 msec | 29694 msec =0.989833
Avg:0.983353
The issue is that reaching steady state AKA varmup takes tremondiusly long time. As it takes long time it causes compilations differences between runs and it's unclear if we actually have finished varmup. Hence I run with Xcomp to get a more stable result.
Integrating later today!
/integrate
Going to push as commit 7b52d0acfc7d6083b407efa0877c139e9837f86b.
Since your change was applied there have been 23 commits pushed to the master branch:
- aa4c83a5bfe146714a46fb454aafc7393d2d8453: 8332505: JEP 457: ClassRemapper forgets to remap bootstrap method references
- cabe337400a0bd61d73bf3ca66e16266267299c7: 8331921: Hotspot assembler files should use common logic to setup exported functions
- 2edb6d98133d8bd6dc4527c7497c460283fdc53e: 8330386: Replace Opaque4Node of Initialized Assertion Predicate with new OpaqueInitializedAssertionPredicateNode
- 1850914380655ef3d08614a5656e7cc23478f38f: 8332864: Parallel: Merge ParMarkBitMapClosure into MoveAndUpdateClosure
- 2f2cf38bb5cecea698e519396574343cfbe4f359: 8332883: Some simple cleanup in vectornode.cpp
- b5e1615c0084538f2161fe9b56748d188983e972: 8292955: Collections.checkedMap Map.merge does not properly check key and value
- 86eb5d9f3be30ff9df1318f18ab73c7129c978f6: 8329958: Windows x86 build fails: downcallLinker.cpp(36) redefinition
- be1d374bc54d43aae3b3c1feace22d38fe2156b6: 8332825: ubsan: guardedMemory.cpp:35:11: runtime error: null pointer passed as argument 2, which is declared to never be null
- ed81a478e175631f1de69eb4b43f927629fefd74: 8327964: Simplify BigInteger.implMultiplyToLen intrinsic
- 08face8c4cd2d0b6f48f99bae5a380b7f7e4f2c2: 8332890: Module imports don't work inside the same module
- ... and 13 more: https://git.openjdk.org/jdk/compare/16dba04e8dfa871f8056480a42a9baeb24a2fb24...master
Your commit was automatically rebased without conflicts.
@robehn Pushed as commit 7b52d0acfc7d6083b407efa0877c139e9837f86b.
:bulb: You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.