jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8325821: [REDO] use "dmb.ishst+dmb.ishld" for release barrier

Open kuaiwei opened this issue 11 months ago • 31 comments

The origin patch for https://bugs.openjdk.org/browse/JDK-8324186 has 2 issues: 1 It show regression in some platform, like Apple silicon in mac os 2 Can not handle instruction sequence like "dmb.ishld; dmb.ishst; dmb.ishld; dmb.ishld"

It can be fixed by: 1 Enable AlwaysMergeDMB by default, only disable it in architecture we can see performance improvement (N1 or N2) 2 Check the special pattern and merge the subsequent dmb.

It also fix a bug when code buffer is expanding, st/ld/dmb can not be merged. I added unit tests for these.

This patch still has a unhandled case. Insts like "dmb.ishld; dmb.ishst; dmb.ish", it will merge the last 2 instructions and can not merge all three. Because when emitting dmb.ish, if merge all previous dmbs, the code buffer will shrink the size. I think it may break some resumption and think it's not a common pattern.

  • Update: After discussion, I made a new implementation based on finite state machine for merging instruction. The mergeable instruction will be pending in fsm until next unmergeable instruction.

Progress

  • [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue

Issue

  • JDK-8325821: [REDO] use "dmb.ishst+dmb.ishld" for release barrier (Enhancement - P4)

Contributors

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18467/head:pull/18467
$ git checkout pull/18467

Update a local copy of the PR:
$ git checkout pull/18467
$ git pull https://git.openjdk.org/jdk.git pull/18467/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 18467

View PR using the GUI difftool:
$ git pr show -t 18467

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18467.diff

Webrev

Link to Webrev Comment

kuaiwei avatar Mar 25 '24 06:03 kuaiwei

:wave: Welcome back kuaiwei! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar Mar 25 '24 06:03 bridgekeeper[bot]

❗ This change is not yet ready to be integrated. See the Progress checklist in the description for automated requirements.

openjdk[bot] avatar Mar 25 '24 06:03 openjdk[bot]

@kuaiwei The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar Mar 25 '24 06:03 openjdk[bot]

Are there any possible sequences of dmb ld; dmb st that will not be minimized?

theRealAph avatar Mar 25 '24 08:03 theRealAph

Are there any possible sequences of dmb ld; dmb st that will not be minimized?

The worst case is "dmb ld; dmb st; dmb ish", that will be merged as "dmb ld; dmb ish".

kuaiwei avatar Mar 25 '24 09:03 kuaiwei

Are there any possible sequences of dmb ld; dmb st that will not be minimized?

The worst case is "dmb ld; dmb st; dmb ish", that will be merged as "dmb ld; dmb ish".

That's not too bad, I suppose, but to me it feels a little like unfinished business.

If I were you I'd accumulate DMBs as 3 bits (ld, st, and sy) and emit the Right Thing on the first non-DMB instruction. There's the interesting possibility that a DMB might be emitted as the last instruction before the end of the lifetime of an assembler. In that case I guess I'd do some experiments to find out what worked.

Also consider a queue? We already accumulate bits of an instruction in Instruction_aarch64::bits. Maybe that could be elaborated to defer DMB instructions?

theRealAph avatar Mar 25 '24 15:03 theRealAph

Also consider a queue? We already accumulate bits of an instruction in Instruction_aarch64::bits. Maybe that could be elaborated to defer DMB instructions?

Or maybe not if that's too complicated. If there's no way to minimize the sequence without a lot of work, I can live with it.

Thanks!

theRealAph avatar Mar 25 '24 15:03 theRealAph

Thinking some more, I've had a better idea. Consider a finite state machine. The states would perhaps be {pending_none, pending_ld, pending_st, pending_ldst}. Every MacroAssembler::emit(), along with bind(), advances the machine, and emits instructions on some state changes. This state machine would be used instead of code()->last_insn()

theRealAph avatar Mar 25 '24 16:03 theRealAph

Hi Andrew, thanks for your suggestions. I will check them and refine the implementation. The difficult part is we need finalize the "dmb" at a non-dmb instruction. It looks Instruction_aarch64 is a good choice because it's embeded in all instructions.

Thanks

kuaiwei avatar Mar 26 '24 02:03 kuaiwei

update: I added finite state machine for merging instruction. The patch can be found in https://github.com/openjdk/jdk/commit/1b18e8298b1ef8778b494fb7ed9e4467e0a9a6b8 . Because instructions are pending, I need modify Assembler::pc() and offset() to count the pending instruction size. It may impact relocation. I fixed some failure and still some test failure to be figure out. I'm working on them.

kuaiwei avatar Apr 02 '24 02:04 kuaiwei

That doesn't look bad. I've made a bunch of simplifications for you to see: have a look. In general, pointer chasing is bad, so it's worth getting rid of indirections that don't help. foo.zip

theRealAph avatar Apr 08 '24 17:04 theRealAph

So I had another thought. I think it'd be easier, and more robust, if you kept the contents of the code buffer valid at all times, and transition the state machine to its initial state whenever there is any attempt to get CodeBuffer::offset(). That would also minimize the impact of this patch on the rest of the code. You can always back up when you see something like dmb st; dmb ld; dmb ish.

theRealAph avatar Apr 09 '24 06:04 theRealAph

That doesn't look bad. I've made a bunch of simplifications for you to see: have a look. In general, pointer chasing is bad, so it's worth getting rid of indirections that don't help. foo.zip

Thanks for your enhancement. I found you removed MergeableInst. My idea is the finite state machine could support both dmb and ld/st instructions. I created another task to merge more instruction like 'ldrs/ldrd/strs/strd ...' . https://bugs.openjdk.org/browse/JDK-8329901 . The old implementation is not suitable to support new instruction. I need duplicate to support other register type like float register. So I want to use MergeableInst to extract instruction information. Such as target register, base register, offseet ... . It's my draft plan, how do you think about it?

kuaiwei avatar Apr 09 '24 07:04 kuaiwei

That doesn't look bad. I've made a bunch of simplifications for you to see: have a look. In general, pointer chasing is bad, so it's worth getting rid of indirections that don't help. foo.zip

Thanks for your enhancement. I found you removed MergeableInst. My idea is the finite state machine could support both dmb and ld/st instructions. I created another task to merge more instruction like 'ldrs/ldrd/strs/strd ...' . https://bugs.openjdk.org/browse/JDK-8329901 . The old implementation is not suitable to support new instruction. I need duplicate to support other register type like float register. So I want to use MergeableInst to extract instruction information. Such as target register, base register, offseet ... . It's my draft plan, how do you think about it?

I think you may have scaling problems with this. If you try to handle multiple kinds of instructions in a single state machine, it is likely to explode in size, and be (even more) confusing to the reader.

theRealAph avatar Apr 09 '24 07:04 theRealAph

That doesn't look bad. I've made a bunch of simplifications for you to see: have a look. In general, pointer chasing is bad, so it's worth getting rid of indirections that don't help. foo.zip

Thanks for your enhancement. I found you removed MergeableInst. My idea is the finite state machine could support both dmb and ld/st instructions. I created another task to merge more instruction like 'ldrs/ldrd/strs/strd ...' . https://bugs.openjdk.org/browse/JDK-8329901 . The old implementation is not suitable to support new instruction. I need duplicate to support other register type like float register. So I want to use MergeableInst to extract instruction information. Such as target register, base register, offseet ... . It's my draft plan, how do you think about it?

I think you may have scaling problems with this. If you try to handle multiple kinds of instructions in a single state machine, it is likely to explode in size, and be (even more) confusing to the reader.

I'm fine with it. So the state machine is only used for dmb. We can figure out how to simplify other instructions later. About CodeBuffer::offset(), does you mean Assembler::offset() ? It worth a try, we may remove the 'const' from its definition and its derived methods.

kuaiwei avatar Apr 09 '24 07:04 kuaiwei

About CodeBuffer::offset(), does you mean Assembler::offset() ? It worth a try, we may remove the 'const' from its definition and its derived methods.

I think so, yes. It's a little bit odd that simply reading an offset has a side effect, but it affects only the state machine, not any of the contents of the buffer.

theRealAph avatar Apr 09 '24 07:04 theRealAph

/contributor add @theRealAph

kuaiwei avatar Apr 10 '24 08:04 kuaiwei

@kuaiwei Contributor Andrew Haley <[email protected]> successfully added.

openjdk[bot] avatar Apr 10 '24 08:04 openjdk[bot]

@kuaiwei Please do not rebase or force-push to an active PR as it invalidates existing review comments. Note for future reference, the bots always squash all changes into a single commit automatically as part of the integration. See OpenJDK Developers’ Guide for more information.

openjdk[bot] avatar Apr 12 '24 05:04 openjdk[bot]

Hi, I guess this isn't quite ready for review let. I'll have another look whan it is.

theRealAph avatar Apr 15 '24 17:04 theRealAph

Hi, I guess this isn't quite ready for review let. I'll have another look whan it is.

Is there any other gap I'm not aware?

kuaiwei avatar Apr 16 '24 02:04 kuaiwei

Hi, I guess this isn't quite ready for review let. I'll have another look whan it is.

Is there any other gap I'm not aware?

Well, you're asking me to speculate on what you're aware of, but the very first thing I see when I run "java -version" with this patch is this, so I assume you're not finished.

  0x0000ffffe8ad2750:   str		w11, [x0, #0x14]    ;*invokespecial <init> {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.lang.StringLatin1::replace@123 (line 427)
 ;; membar_release
  0x0000ffffe8ad2754:   dmb		ishld
  0x0000ffffe8ad2758:   dmb		ishst               ;*new {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.lang.StringLatin1::replace@116 (line 427)
 ;; membar_release
  0x0000ffffe8ad275c:   dmb		ishld
  0x0000ffffe8ad2760:   dmb		ishst               ;*synchronization entry
                                                            ; - java.lang.StringLatin1::replace@-1 (line 408)

theRealAph avatar Apr 16 '24 09:04 theRealAph

Hi, I guess this isn't quite ready for review let. I'll have another look whan it is.

Is there any other gap I'm not aware?

Well, you're asking me to speculate on what you're aware of, but the very first thing I see when I run "java -version" with this patch is this, so I assume you're not finished.

  0x0000ffffe8ad2750:   str		w11, [x0, #0x14]    ;*invokespecial <init> {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.lang.StringLatin1::replace@123 (line 427)
 ;; membar_release
  0x0000ffffe8ad2754:   dmb		ishld
  0x0000ffffe8ad2758:   dmb		ishst               ;*new {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.lang.StringLatin1::replace@116 (line 427)
 ;; membar_release
  0x0000ffffe8ad275c:   dmb		ishld
  0x0000ffffe8ad2760:   dmb		ishst               ;*synchronization entry
                                                            ; - java.lang.StringLatin1::replace@-1 (line 408)

I verified and not reproduced the error. Last week I rebase the patch to resolve a conflict with master branch. I think the error may be caused by apply the patch to old base.

kuaiwei avatar Apr 16 '24 12:04 kuaiwei

Argh, I found it. It happens because C2 calls masm->offset() from PhaseOutput::fill_buffer() after every node is emitted. So that trick isn't going to work.

It was worth a try, but given that C2 expects offset() to be correct after every node, I think we're stuck. Maybe the last idea you had is the best possible without C2 tinkering.

theRealAph avatar Apr 16 '24 14:04 theRealAph

Argh, I found it. It happens because C2 calls masm->offset() from PhaseOutput::fill_buffer() after every node is emitted. So that trick isn't going to work.

It was worth a try, but given that C2 expects offset() to be correct after every node, I think we're stuck. Maybe the last idea you had is the best possible without C2 tinkering.

got it. I will check if we can make offset() work with fill_buffer. Or I will rollback the change of offset().

kuaiwei avatar Apr 17 '24 02:04 kuaiwei

@kuaiwei this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout recover_dmb
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

openjdk[bot] avatar Apr 22 '24 08:04 openjdk[bot]

Argh, I found it. It happens because C2 calls masm->offset() from PhaseOutput::fill_buffer() after every node is emitted. So that trick isn't going to work. It was worth a try, but given that C2 expects offset() to be correct after every node, I think we're stuck. Maybe the last idea you had is the best possible without C2 tinkering.

got it. I will check if we can make offset() work with fill_buffer. Or I will rollback the change of offset().

I rolled back the offset() change.

kuaiwei avatar Apr 22 '24 11:04 kuaiwei

@theRealAph Could you help review this PR? Thanks.

kuaiwei avatar Apr 29 '24 02:04 kuaiwei

@theRealAph Could you help review this PR? Thanks.

I think we should go with your original simple patch for now. Trying to make the Assembler do the optimal thing has not turned out to be very easy, and I'm worried it's too much of a maintenance burden. Simply emitting dmb st; dmb ld for releasing stores is enough for now.

Thank you for trying to make this work. I still have in my mind that there might be an easy way to do it, but it's looking unlikely.

theRealAph avatar May 10 '24 08:05 theRealAph