jdk
jdk copied to clipboard
8325821: [REDO] use "dmb.ishst+dmb.ishld" for release barrier
The origin patch for https://bugs.openjdk.org/browse/JDK-8324186 has 2 issues: 1 It show regression in some platform, like Apple silicon in mac os 2 Can not handle instruction sequence like "dmb.ishld; dmb.ishst; dmb.ishld; dmb.ishld"
It can be fixed by: 1 Enable AlwaysMergeDMB by default, only disable it in architecture we can see performance improvement (N1 or N2) 2 Check the special pattern and merge the subsequent dmb.
It also fix a bug when code buffer is expanding, st/ld/dmb can not be merged. I added unit tests for these.
This patch still has a unhandled case. Insts like "dmb.ishld; dmb.ishst; dmb.ish", it will merge the last 2 instructions and can not merge all three. Because when emitting dmb.ish, if merge all previous dmbs, the code buffer will shrink the size. I think it may break some resumption and think it's not a common pattern.
- Update: After discussion, I made a new implementation based on finite state machine for merging instruction. The mergeable instruction will be pending in fsm until next unmergeable instruction.
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
Issue
- JDK-8325821: [REDO] use "dmb.ishst+dmb.ishld" for release barrier (Enhancement - P4)
Contributors
- Andrew Haley
<[email protected]>
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18467/head:pull/18467
$ git checkout pull/18467
Update a local copy of the PR:
$ git checkout pull/18467
$ git pull https://git.openjdk.org/jdk.git pull/18467/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 18467
View PR using the GUI difftool:
$ git pr show -t 18467
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18467.diff
Webrev
:wave: Welcome back kuaiwei! A progress list of the required criteria for merging this PR into master
will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
❗ This change is not yet ready to be integrated. See the Progress checklist in the description for automated requirements.
@kuaiwei The following label will be automatically applied to this pull request:
-
hotspot
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.
Webrevs
- 08: Full (c5a26a3a)
- 07: Full - Incremental (1e9fb025)
- 06: Full - Incremental (4bd183fb)
- 05: Full (e2d3e1e4)
- 04: Full - Incremental (57311189)
- 03: Full - Incremental (1a49c60c)
- 02: Full - Incremental (fe4f4f20)
- 01: Full - Incremental (8ae4496e)
- 00: Full (29e39bf0)
Are there any possible sequences of dmb ld; dmb st
that will not be minimized?
Are there any possible sequences of
dmb ld; dmb st
that will not be minimized?
The worst case is "dmb ld; dmb st; dmb ish", that will be merged as "dmb ld; dmb ish".
Are there any possible sequences of
dmb ld; dmb st
that will not be minimized?The worst case is "dmb ld; dmb st; dmb ish", that will be merged as "dmb ld; dmb ish".
That's not too bad, I suppose, but to me it feels a little like unfinished business.
If I were you I'd accumulate DMB
s as 3 bits (ld, st, and sy) and emit the Right Thing on the first non-DMB instruction. There's the interesting possibility that a DMB might be emitted as the last instruction before the end of the lifetime of an assembler. In that case I guess I'd do some experiments to find out what worked.
Also consider a queue? We already accumulate bits of an instruction in Instruction_aarch64::bits
. Maybe that could be elaborated to defer DMB
instructions?
Also consider a queue? We already accumulate bits of an instruction in
Instruction_aarch64::bits
. Maybe that could be elaborated to deferDMB
instructions?
Or maybe not if that's too complicated. If there's no way to minimize the sequence without a lot of work, I can live with it.
Thanks!
Thinking some more, I've had a better idea. Consider a finite state machine. The states would perhaps be {pending_none, pending_ld, pending_st, pending_ldst}. Every MacroAssembler::emit()
, along with bind()
, advances the machine, and emits instructions on some state changes. This state machine would be used instead of code()->last_insn()
Hi Andrew, thanks for your suggestions. I will check them and refine the implementation. The difficult part is we need finalize the "dmb" at a non-dmb instruction. It looks Instruction_aarch64 is a good choice because it's embeded in all instructions.
Thanks
update: I added finite state machine for merging instruction. The patch can be found in https://github.com/openjdk/jdk/commit/1b18e8298b1ef8778b494fb7ed9e4467e0a9a6b8 . Because instructions are pending, I need modify Assembler::pc() and offset() to count the pending instruction size. It may impact relocation. I fixed some failure and still some test failure to be figure out. I'm working on them.
That doesn't look bad. I've made a bunch of simplifications for you to see: have a look. In general, pointer chasing is bad, so it's worth getting rid of indirections that don't help. foo.zip
So I had another thought. I think it'd be easier, and more robust, if you kept the contents of the code buffer valid at all times, and transition the state machine to its initial state whenever there is any attempt to get CodeBuffer::offset()
. That would also minimize the impact of this patch on the rest of the code. You can always back up when you see something like dmb st; dmb ld; dmb ish
.
That doesn't look bad. I've made a bunch of simplifications for you to see: have a look. In general, pointer chasing is bad, so it's worth getting rid of indirections that don't help. foo.zip
Thanks for your enhancement. I found you removed MergeableInst. My idea is the finite state machine could support both dmb and ld/st instructions. I created another task to merge more instruction like 'ldrs/ldrd/strs/strd ...' . https://bugs.openjdk.org/browse/JDK-8329901 . The old implementation is not suitable to support new instruction. I need duplicate to support other register type like float register. So I want to use MergeableInst to extract instruction information. Such as target register, base register, offseet ... . It's my draft plan, how do you think about it?
That doesn't look bad. I've made a bunch of simplifications for you to see: have a look. In general, pointer chasing is bad, so it's worth getting rid of indirections that don't help. foo.zip
Thanks for your enhancement. I found you removed MergeableInst. My idea is the finite state machine could support both dmb and ld/st instructions. I created another task to merge more instruction like 'ldrs/ldrd/strs/strd ...' . https://bugs.openjdk.org/browse/JDK-8329901 . The old implementation is not suitable to support new instruction. I need duplicate to support other register type like float register. So I want to use MergeableInst to extract instruction information. Such as target register, base register, offseet ... . It's my draft plan, how do you think about it?
I think you may have scaling problems with this. If you try to handle multiple kinds of instructions in a single state machine, it is likely to explode in size, and be (even more) confusing to the reader.
That doesn't look bad. I've made a bunch of simplifications for you to see: have a look. In general, pointer chasing is bad, so it's worth getting rid of indirections that don't help. foo.zip
Thanks for your enhancement. I found you removed MergeableInst. My idea is the finite state machine could support both dmb and ld/st instructions. I created another task to merge more instruction like 'ldrs/ldrd/strs/strd ...' . https://bugs.openjdk.org/browse/JDK-8329901 . The old implementation is not suitable to support new instruction. I need duplicate to support other register type like float register. So I want to use MergeableInst to extract instruction information. Such as target register, base register, offseet ... . It's my draft plan, how do you think about it?
I think you may have scaling problems with this. If you try to handle multiple kinds of instructions in a single state machine, it is likely to explode in size, and be (even more) confusing to the reader.
I'm fine with it. So the state machine is only used for dmb. We can figure out how to simplify other instructions later. About CodeBuffer::offset(), does you mean Assembler::offset() ? It worth a try, we may remove the 'const' from its definition and its derived methods.
About CodeBuffer::offset(), does you mean Assembler::offset() ? It worth a try, we may remove the 'const' from its definition and its derived methods.
I think so, yes. It's a little bit odd that simply reading an offset has a side effect, but it affects only the state machine, not any of the contents of the buffer.
/contributor add @theRealAph
@kuaiwei
Contributor Andrew Haley <[email protected]>
successfully added.
@kuaiwei Please do not rebase or force-push to an active PR as it invalidates existing review comments. Note for future reference, the bots always squash all changes into a single commit automatically as part of the integration. See OpenJDK Developers’ Guide for more information.
Hi, I guess this isn't quite ready for review let. I'll have another look whan it is.
Hi, I guess this isn't quite ready for review let. I'll have another look whan it is.
Is there any other gap I'm not aware?
Hi, I guess this isn't quite ready for review let. I'll have another look whan it is.
Is there any other gap I'm not aware?
Well, you're asking me to speculate on what you're aware of, but the very first thing I see when I run "java -version" with this patch is this, so I assume you're not finished.
0x0000ffffe8ad2750: str w11, [x0, #0x14] ;*invokespecial <init> {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.StringLatin1::replace@123 (line 427)
;; membar_release
0x0000ffffe8ad2754: dmb ishld
0x0000ffffe8ad2758: dmb ishst ;*new {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.StringLatin1::replace@116 (line 427)
;; membar_release
0x0000ffffe8ad275c: dmb ishld
0x0000ffffe8ad2760: dmb ishst ;*synchronization entry
; - java.lang.StringLatin1::replace@-1 (line 408)
Hi, I guess this isn't quite ready for review let. I'll have another look whan it is.
Is there any other gap I'm not aware?
Well, you're asking me to speculate on what you're aware of, but the very first thing I see when I run "java -version" with this patch is this, so I assume you're not finished.
0x0000ffffe8ad2750: str w11, [x0, #0x14] ;*invokespecial <init> {reexecute=0 rethrow=0 return_oop=0} ; - java.lang.StringLatin1::replace@123 (line 427) ;; membar_release 0x0000ffffe8ad2754: dmb ishld 0x0000ffffe8ad2758: dmb ishst ;*new {reexecute=0 rethrow=0 return_oop=0} ; - java.lang.StringLatin1::replace@116 (line 427) ;; membar_release 0x0000ffffe8ad275c: dmb ishld 0x0000ffffe8ad2760: dmb ishst ;*synchronization entry ; - java.lang.StringLatin1::replace@-1 (line 408)
I verified and not reproduced the error. Last week I rebase the patch to resolve a conflict with master branch. I think the error may be caused by apply the patch to old base.
Argh, I found it. It happens because C2 calls masm->offset()
from PhaseOutput::fill_buffer()
after every node is emitted. So that trick isn't going to work.
It was worth a try, but given that C2 expects offset() to be correct after every node, I think we're stuck. Maybe the last idea you had is the best possible without C2 tinkering.
Argh, I found it. It happens because C2 calls
masm->offset()
fromPhaseOutput::fill_buffer()
after every node is emitted. So that trick isn't going to work.It was worth a try, but given that C2 expects offset() to be correct after every node, I think we're stuck. Maybe the last idea you had is the best possible without C2 tinkering.
got it. I will check if we can make offset() work with fill_buffer. Or I will rollback the change of offset().
@kuaiwei this pull request can not be integrated into master
due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:
git checkout recover_dmb
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push
Argh, I found it. It happens because C2 calls
masm->offset()
fromPhaseOutput::fill_buffer()
after every node is emitted. So that trick isn't going to work. It was worth a try, but given that C2 expects offset() to be correct after every node, I think we're stuck. Maybe the last idea you had is the best possible without C2 tinkering.got it. I will check if we can make offset() work with fill_buffer. Or I will rollback the change of offset().
I rolled back the offset() change.
@theRealAph Could you help review this PR? Thanks.
@theRealAph Could you help review this PR? Thanks.
I think we should go with your original simple patch for now. Trying to make the Assembler do the optimal thing has not turned out to be very easy, and I'm worried it's too much of a maintenance burden. Simply emitting dmb st; dmb ld
for releasing stores is enough for now.
Thank you for trying to make this work. I still have in my mind that there might be an easy way to do it, but it's looking unlikely.