jdk
jdk copied to clipboard
8360654: AArch64: Remove redundant dmb from C1 compareAndSet
AtomicLong.CompareAndSet has the following assembly dump snippet which gets emitted from the intermediary LIRGenerator::atomic_cmpxchg:
;; cmpxchg {
0x0000e708d144cf60: mov x8, x2
0x0000e708d144cf64: casal x8, x3, [x0]
0x0000e708d144cf68: cmp x8, x2
;; 0x1F1F1F1F1F1F1F1F
0x0000e708d144cf6c: mov x8, #0x1f1f1f1f1f1f1f1f
;; } cmpxchg
0x0000e708d144cf70: cset x8, ne // ne = any
0x0000e708d144cf74: dmb ish
According to the Oracle Java Specification, AtomicLong.CompareAndSet [1] has the same memory effects as specified by VarHandle.compareAndSet which has the following effects: [2]
Atomically sets the value of a variable to the newValue with the memory semantics of setVolatile if the variable's current value, referred to as the witness value, == the expectedValue, as accessed with the memory semantics of getVolatile.
Hence the release on the store due to setVolatile only occurs if the compare is successful. Since casal already satisfies these requirements, the dmb does not need to occur to ensure memory ordering in case the compare fails and a release does not happen.
Hence we remove the dmb from both casl and casw (same logic applies to the non-long variant)
This is also reflected by C2 not having a dmb for the same respective method.
[1] https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/util/concurrent/atomic/AtomicLong.html#compareAndSet(long,long) [2] https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/lang/invoke/VarHandle.html#compareAndSet(java.lang.Object...)
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
Issue
- JDK-8360654: AArch64: Remove redundant dmb from C1 compareAndSet (Enhancement - P4)
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26000/head:pull/26000
$ git checkout pull/26000
Update a local copy of the PR:
$ git checkout pull/26000
$ git pull https://git.openjdk.org/jdk.git pull/26000/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 26000
View PR using the GUI difftool:
$ git pr show -t 26000
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26000.diff
Using Webrev
Hi @spchee, welcome to this OpenJDK project and thanks for contributing!
We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing /signed in a comment in this pull request.
If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please use "Add GitHub user spchee" as summary for the issue.
If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing /covered in a comment in this pull request.
❗ This change is not yet ready to be integrated. See the Progress checklist in the description for automated requirements.
@spchee The following label will be automatically applied to this pull request:
hotspot-compiler
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.
/covered
Thank you! Please allow for a few business days to verify that your employer has signed the OCA. Also, please note that pull requests that are pending an OCA check will not usually be evaluated, so your patience is appreciated!
Webrevs
- 05: Full - Incremental (6d9030d3)
- 04: Full - Incremental (135123cb)
- 03: Full - Incremental (092c92e9)
- 02: Full - Incremental (8eb9096d)
- 01: Full - Incremental (181ce0b7)
- 00: Full (577e9a20)
I'm not sure this is safe.
In C1, if we had a stlxr that succeeded, followed by a volatile load, then we'd generate:
stlxr status, data, [addr]
cbnz status, retry
ldr r1, [something]
dmb ish
I think there's nothing to prevent the ldr from being reordered with the stlxr, violating sequential consistency. You could argue that subsequent memory operations are control dependent on the cbnz so can't be reordered, but if a microarchitecture predicts that the stlxr can never fail, the control dependency can be folded away.
Thanks for the feedback, could it be sufficient then to emit the dmb when not using LSE? And when using LSE, to not do so?
I have tested it with jcstress, although I have not done so using the -XX:-UseLSE flag.
Thanks for the feedback, could it be sufficient then to emit the dmb when not using LSE? And when using LSE, to not do so?
I don't think that LSE helps here. There's nothing to stop an unordered load moving before a casal. See atomic-ordered-before in The AArch64 Application Level Memory Model.
The best way to check this stuff is with the herd7 memory model simulator, which can find reorderings you wouldn't believe.
I can double check with the herd7 simulator, but since the casal will always produce an acquire, to me it seems impossible that a load can be moved before the casal due the acquire within the casal.
Clause 9 of before-barrier-ordering in the Arm Architecture reference manual also supports this.
Clause 9 of before-barrier-ordering in the Arm Architecture reference manual also supports this.
Which clause is that?
Hi @theRealAph,
The clause can be find here, the last bullet point on this page - https://mozilla.github.io/pdf.js/web/viewer.html?file=https://documentation-service.arm.com/static/6839d7585475b403d943b4dc#page=255&pagemode=none
Also, we have come up with two herd7 tests which should hopefully prove it to be alright.
{
x=0;
y=0;
0:X1=x; 0:X3=y;
1:X1=x; 1:X3=y;
}
P0 | P1 ;
MOV W0,#1 | MOV W0, #1 ;
MOV W2,#2 | MOV W2, #2 ;
CASAL W0, W2, [X1] | ;
LDR W4,[X3] | STR W0, [X3] ;
| DMB ISH ;
| STR W2, [X1] ;
exists
(0:X0=2 /\ 0:X4=0)
Here, the stores by P1 are happening in order: y = 1; x = 2; and the reads in P0 are happening by CASAL first - from x and then by LDR - from y. The condition constraint checks is that CASAL can't read 2 from x if LDR read 0 from y - the constraint should be fulfilled unless the reads are reordered.
And
{
x = 1; y =1;
0: X1=x; 0:X3=y;
1: X3=x; 1:X1=y;
}
P0 | P1 ;
MOV W0, #1 | MOV W0, #1;
MOV W2, #2 | MOV W2, #2;
CASAL W0, W2, [X1] | CASAL W0, W2, [X1];
LDR W4, [X3] | LDR W4, [X3];
exists
(0:X4=1 /\ 1: X4=1)
Here, both X4's being equal to 1 is disallowed, as that would indicate that one of the ldrs was reordered before the CASAL. As the CASAL's will always succeed by default meaning atleast one of the LDR's will load a non-1 value into W4. Hence (0:X4=1 /\ 1: X4=1) can only ever occur if an ldr gets ordered before the CASAL.
Hope this helps :)
On 04/07/2025 17:28, Samuel Chee wrote:
Hope this helps :)
Thanks, this looks convincing.
Please allow some time for me to do some more checking. This is a tricky area, and the the cost if we get it wrong is high.
FYI, I'm still looking at this.
It seems that the definition of barrier-ordered-before has been strengthened since this code was written. A test that I wrote a few years ago now passes on the online Herd7 simulator, where it used to fail. Back then I commented
// At the time of writing we don't know of any AArch64 hardware that
// reorders stores in this way, but the Reference Manual permits it.
... and confirmed my interpretation with the author of the Reference Manual.
I'm guessing that older AArch64 implementations still in use never did such reorderings.
With the help of Will Deacon, one of the authors of the memory model. I've now got to the bottom of this.
It is indeed a change to the MM, dating from 2002.
I agree that the DMB isn't needed here because the CASAL has both acquire and release semantics. However, I don't think that's related to the snippet of the architecture you have above but rather comes from:
// DDI0487L_b // Barrier-ordered-before (B2-255) ...
- All of the following apply:
- E1 is an Explicit Memory Write Effect and is generated by an atomic instruction with both Acquire and Release semantics.
- E1 appears in program order before E2.
- One of the following applies:
- E2 is an Explicit Memory Effect.
- E2 is an Implicit Tag Memory Read Effect.
- E2 is an MMU Fault Effect.
Which says that the release store of the CASAL is ordered before the the subsequent store to y. Note that this wouldn't work if you used CASL instead.
The full details of the MM change are here:
http://github.com/herd/herdtools7/commit/636b7163c0679c691b8cf9a04623cd3aa1cc0ec3
So, this change looks good, and we can remove trailing DMBs from most CASALs.
Mailing list message from Andrew Haley on hotspot-compiler-dev:
On 15/07/2025 12:02, Andrew Haley wrote:
It is indeed a change to the MM, dating from 2002.
2022, obvs.
Great! Glad you're convinced of its correctness.
we can remove trailing DMBs from most CASALs.
Just do bear in mind that CASAL doesn't emit release semantics if the compare fails so I imagine there might be cases where a trailing dmb might still necessary.
Great! Glad you're convinced of its correctness.
we can remove trailing DMBs from most CASALs.
Just do bear in mind that CASAL doesn't emit release semantics if the compare fails so I imagine there might be cases where a trailing dmb might still necessary.
Yes, that's what the change log in the link I posted says too. We don't care about such assumptions in Java code, but in some C++ code which assumes that CAS implies a full barrier. I don't think anyone really knows every bit of C++ code in HotSpot which does assume this, so we tend to assume the worst.
I think we still need a DMB after non-LSE CMPXCHG, which gets failures without this DMB:
AArch64 MP
{
0:X0=x; 0:X2=y;
1:X0=y; 1:X4=x;
}
P0 | P1 ;
LDAR W1,[X0] | MOV W2,#1 ;
| L0: ;
LDR W3,[X2] | LDAXR W1,[X0] ;
| STLXR W8,W2,[X0] ;
| CBNZ W8,L0;
| DMB ISH;
| MOV W3,#1 ;
| STR W3,[X4] ;
exists (0:X1=1 /\ 0:X3=0 /\ 1:X1=0)
Have just updated with change to have it still emit a dmb when LSE is not enabled. Should be good to go now hopefully :)
@spchee This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!
/keepalive
@theRealAph The pull request is being re-evaluated and the inactivity timeout has been reset.
/keepalive
@ruben-arm To use the /touch command, you need to be in the OpenJDK census and your GitHub account needs to be linked with your OpenJDK username (how to associate your GitHub account with your OpenJDK username).
/keepalive
@eastig The pull request is being re-evaluated and the inactivity timeout has been reset.
Hi @theRealAph,
I've pushed changes for this PR to a new branch https://github.com/openjdk/jdk/compare/master...ruben-arm:jdk:pr-8360654 as Samuel is currently not available. Once he is back, he can update this PR's branch.
In the meanwhile, I'm planning to run more of the jcstress testing. I'd appreciate your feedback on the version in the new branch.