JDK11 OpenJ9 Gives Non-deterministic Results
Java version
openjdk version "11.0.19-internal" 2023-04-18
OpenJDK Runtime Environment (build 11.0.19-internal+0-adhoc..openj9-openjdk-jdk11)
Eclipse OpenJ9 VM (build master-f14707f, JRE 11 Linux amd64-64-Bit Compressed References 20230303_000000 (JIT enabled, AOT enabled)
OpenJ9 - f14707f
OMR - 342b647
JCL - a8cc681 based on jdk-11.0.19+4)
Javac version
javac 11.0.19-internal
Code and summary of the problem
The following test makes OpenJ9 (with above version) generate different result every time you run it.
import java.net.Socket;
class T {
int a;
long z;
void foo(boolean b, int c) {
c *= --c;
Socket t = new Socket();
String[][] q = {};
String p = "-000-0";
for (int k = 395; k < 5172; k += 1) {
z += c;
try {
if (q[a][1].equals(p)) {}
} catch (Throwable x) {
} finally {
}
}
}
void too(int i, int k) {
int j = 1;
while (++j < 104) foo(true, k);
}
public static void main(String[] g) {
T t = new T();
t.too(t.a, t.a);
System.out.println(t.z);
}
}
Results of OpenJ9: every time you run it, OpenJ9 gives a different result
> ./OpenJ9/jdk11/bin/java T
277066
> ./OpenJ9/jdk11/bin/java T
291397
> ./OpenJ9/jdk11/bin/java T
286620
> ./OpenJ9/jdk11/bin/java T
296174
The correct answer is 0:
> ./OpenJ9/jdk11/bin/java -Xint T
296174
> ./HotSpot/jdk11u/bin/java T
0
Diagnosis
I have reported a similar bug in #15306, but I reported this bug in a new thread because (1) this one looks different from 15306, (2) 15306 is marked as fixed, and (3) this one is generated in a very recent development trunk.
I have tried a little bit, and noticed that the non-determinisn starts from optlevel=hot with method T.foo(). See below
> ./OpenJ9/jdk11/bin/java '-Xjit:exclude={T.foo(ZI)V}' T
0
> ./OpenJ9/jdk11/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=noopt)' T
0
> ./OpenJ9/jdk11/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=warm)' T
0
> ./OpenJ9/jdk11/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=hot)' T
458592
> ./OpenJ9/jdk11/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=veryhot)' T
453815
> ./OpenJ9/jdk11/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=scorching)' T
463369
@0xdaryl @hzongaro fyi
@connglli, I haven't been able to reproduce the failure. I'll keep trying, but in case I am unable, may I ask you to try producing a log file with the following command invocation?
./OpenJ9/jdk11/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=hot,traceFull,log=issue17033.log)' T
> ./OpenJ9/jdk11/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=hot,traceFull,log=issue17033.log)' T
210188
@hzongaro See issue17033.tar.gz
@connglli, thank you for uploading the log file. May I ask you to try reproducing the failure with this invocation? You can try it with a variety of optLevel settings, if you have the opportunity.
/OpenJ9/jdk11/bin/java '-Xjit:{T.foo(ZI)V}(optLevel=hot,disableDynamicLoopTransfer)' T
I took the latest build and I'm able to reproduce it often, although I couldn't with 11.0.18.
I can reproduce it with '-Xjit:{T.foo(ZI)V}(optLevel=scorching,disableDynamicLoopTransfer)' and '-Xjit:{T.foo(ZI)V}(optLevel=hot,disableDynamicLoopTransfer)'. It seems to stop when I add -Xnoaot.
I destroyed my shared cache for jdk-11.0.18, and on the 9th run the problem started to occur.
@hzongaro See the following logs. Hope they can help!
For optlevel=hot: see issue17033.hot.2223504.66143.20230329.130223.2223504.log
> ./OpenJ9/jdk11/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=hot,disableDynamicLoopTransfer,traceFull,log=issue17033.hot)' T
353498
For optlevel=veryhot: see issue17033.veryhot.2223590.66155.20230329.130235.2223590.log
> ./OpenJ9/jdk11/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=veryhot,disableDynamicLoopTransfer,traceFull,log=issue17033.veryhot)' T
176749
For optlevel=scorching: see issue17033.scorching.2223674.66166.20230329.130246.2223674.log
> ./OpenJ9/jdk11/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=scorching,disableDynamicLoopTransfer,traceFull,log=issue17033.scorching)' T
324836
Thanks, Peter @pshipton . I was eventually able to reproduce the failure. I've found that once the failure occurs, I can reproduce it with lastOptIndex=206, but not with lastOptIndex=205:
$ ./jdk/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=hot,lastOptIndex=205)' T
0
$ ./jdk/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=hot,lastOptIndex=206)' T
458592
$ ./jdk/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=hot,lastOptIndex=205)' T
0
$ ./jdk/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=hot,lastOptIndex=206)' T
458592
That appears to correspond to Store Sinking:
$ ./jdk/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=hot,disableStoreSinking)' T
0
$ ./jdk/bin/java '-Xjit:{T.foo(ZI)V}(optlevel=hot)' T
463369
I'll continue to investigate
Hi Henry @hzongaro, could you please explain the lastOptIndex to me?
Hi, @connglli. The lastOptIndex option is short for "last optimization index" - it's used in attempting to narrow down which optimization might be causing a failure. No optimizations that are not absolutely required will run after the optimization that corresponds to that particular optimization number. As each optimization in a compilation is attempted, the optimization index is incremented, so it isn't immediately obvious which optimization 206 might be. I had to check a trace log to determine it was Store Sinking in this case.
You can read more about it in the Problem Determination guide. It's not something that should be relied upon outside of the context of investigating failures.
Looking at the transformations that are happening in Store Sinking, beforehand we see this in the IL:
n26n treetop [0x7f129b005240] bci=[-1,1,8] rc=0 vc=2376 vn=- li=- udi=- nc=1
n21n iload <parm 2 I>[#419 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [0x7f129b0050b0] bci=[-1,0,8] rc=2 vc=2376 vn=- li=50 udi=52 nc=0 flg=0x1000
n25n istore <parm 2 I>[#419 Parm] [flags 0x40000103 0x0 ] [0x7f129b0051f0] bci=[-1,1,8] rc=0 vc=2376 vn=- li=14 udi=1 nc=1
n24n iadd [0x7f129b0051a0] bci=[-1,1,8] rc=1 vc=2376 vn=- li=- udi=- nc=2
n22n iload <parm 2 I>[#419 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [0x7f129b005100] bci=[-1,1,8] rc=1 vc=2376 vn=- li=51 udi=53 nc=0 flg=0x1000
n23n iconst -1 (X!=0 X<=0 ) [0x7f129b005150] bci=[-1,1,8] rc=1 vc=2376 vn=- li=- udi=- nc=0 flg=0x204
n29n istore <parm 2 I>[#419 Parm] [flags 0x40000103 0x0 ] [0x7f129b005330] bci=[-1,6,8] rc=0 vc=2376 vn=- li=15 udi=2 nc=1
n28n imul [0x7f129b0052e0] bci=[-1,5,8] rc=1 vc=2376 vn=- li=- udi=- nc=2
n21n ==>iload
n27n iload <parm 2 I>[#419 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [0x7f129b005290] bci=[-1,4,8] rc=1 vc=2376 vn=- li=52 udi=54 nc=0 flg=0x1000
That corresponds to this assignment at the start of T.foo:
c *= --c;
After Store Sinking, we see this - the istore at node n29n has been replaced with a treetop
n26n treetop [0x7f129b005240] bci=[-1,1,8] rc=0 vc=2585 vn=- li=- udi=- nc=1
n21n iload <parm 2 I>[#419 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [0x7f129b0050b0] bci=[-1,0,8] rc=2 vc=2585 vn=- li=- udi=52 nc=0 flg=0x1000
n25n istore <parm 2 I>[#419 Parm] [flags 0x40000103 0x0 ] [0x7f129b0051f0] bci=[-1,1,8] rc=0 vc=2584 vn=- li=- udi=1 nc=1
n24n iadd [0x7f129b0051a0] bci=[-1,1,8] rc=1 vc=2584 vn=- li=- udi=- nc=2
n22n iload <parm 2 I>[#419 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [0x7f129b005100] bci=[-1,1,8] rc=1 vc=2584 vn=- li=- udi=53 nc=0 flg=0x1000
n23n iconst -1 (X!=0 X<=0 ) [0x7f129b005150] bci=[-1,1,8] rc=1 vc=2584 vn=- li=- udi=- nc=0 flg=0x204
n29n treetop <<< [0x7f129b005330] bci=[-1,6,8] rc=0 vc=2583 vn=- li=- udi=2 nc=1
n28n imul [0x7f129b0052e0] bci=[-1,5,8] rc=1 vc=2583 vn=- li=- udi=- nc=2
n21n ==>iload
n27n iload <parm 2 I>[#419 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [0x7f129b005290] bci=[-1,4,8] rc=1 vc=2583 vn=- li=- udi=54 nc=0 flg=0x1000
Digging into things a bit more, the problem lies with the way Store Sinking has propagated the istore from node n29n to later blocks. Before Store Sinking, the two stores to #419 appear in block_126, which is highlighted in the attached PDF file cfg.prestoresink.issue17033.pdf. After Store Sinking, the second store to #419 has been propagated to blocks 121, 123, 125, 169 and 170, which are highlighted in this second attached PDF file cfg.poststoresink.issue17033.pdf
Blocks 169 and 170 are predecessors of the loop that appears in the method. The problem is that the stores that have been propagated to those later blocks look like this:
n3714n istore <parm 2 I>[#419 Parm] [flags 0x40000103 0x0 ] [0x7f70175f78a0] bci=[-1,6,8] rc=0 vc=2583 vn=- li=- udi=2 nc=1
n3715n imul [0x7f70175f78f0] bci=[-1,5,8] rc=1 vc=2583 vn=- li=- udi=- nc=2
n3716n iload <parm 2 I>[#419 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [0x7f70175f7940] bci=[-1,0,8] rc=1 vc=2585 vn=- li=- udi=52 nc=0 flg=0x1000
n3717n iload <parm 2 I>[#419 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [0x7f70175f7990] bci=[-1,4,8] rc=1 vc=2583 vn=- li=- udi=54 nc=0 flg=0x1000
That is, they multiply by itself the value of #419 that was stored by node n25n above, whereas the original multiplication at node n28n used a commoned node n21n as one operand and the result of the store from n25n as the other operand.
@hzongaro : should this still be targeted to 0.40? Please advise.
@0xdaryl, I haven't been able to spend time on it. Although Peter mentioned above that he wasn't able to reproduce the failure with 11.0.18, I don't see any changes in Store Sinking that would indicate that this is a recent regression. I think it's safe to defer it beyond 0.40.
Thanks. Moving to 0.41 (I mis-spoke about the release in my earlier comment).
No further progress - moving to the 0.43 release.
Following is a slightly modified version of the test case with which I can more reliably reproduce the failure:
import java.net.Socket;
class Issue17033c {
int a = 1;
long z;
void foo(boolean b, int c, int d, int e, int f, int g, int h) {
c += e + h + (e = c + d) + (h = f + g) + e + h;
Socket t = new Socket();
String[][] q = {};
String p = "-000-0";
if (b) {
for (int k = 395; k < 5172; k += 1) {
z += c;
try {
if (q[a][1].equals(p)) {}
} catch (Throwable thr) {
} finally {
}
}
}
}
void too(int i, int b) {
int j = 1;
while (++j < 104) foo(true, i, i, i, i, i, i);
}
public static void main(String[] g) {
Issue17033c t = new Issue17033c();
t.too(1, 1);
System.out.println(t.z);
}
}
$ java -Xint Issue17033c
5359794
$ java -Xjit:{Issue17033c.foo*}\(optLevel=hot,optDetails,traceGeneralStoreSinking,traceLiveness,log=issue17033c.log\) -XX:-EnableHCR Issue17033c
6095452
Before general store sinking:
n29n treetop [ 0x125d60fc0] bci=[-1,10,8] rc=0 vc=1500 vn=- li=- udi=- nc=1
n24n iadd [ 0x125d60e30] bci=[-1,5,8] rc=2 vc=1500 vn=- li=- udi=- nc=2
n22n iload <parm 4 I>[#353 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [ 0x125d60d90] bci=[-1,1,8] rc=1 vc=1500 vn=- li=19 udi=13 nc=0 flg=0x1000
n23n iload <parm 7 I>[#356 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [ 0x125d60de0] bci=[-1,3,8] rc=1 vc=1500 vn=- li=20 udi=14 nc=0 flg=0x1000
n28n istore <parm 4 I>[#353 Parm] [flags 0x40000103 0x0 ] [ 0x125d60f70] bci=[-1,10,8] rc=0 vc=1500 vn=- li=9 udi=1 nc=1
n27n iadd [ 0x125d60f20] bci=[-1,8,8] rc=2 vc=1500 vn=- li=- udi=- nc=2
n25n iload <parm 2 I>[#351 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [ 0x125d60e80] bci=[-1,6,8] rc=1 vc=1500 vn=- li=21 udi=15 nc=0 flg=0x1000
n26n iload <parm 3 I>[#352 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [ 0x125d60ed0] bci=[-1,7,8] rc=1 vc=1500 vn=- li=- udi=16 nc=0 flg=0x1000
n34n istore <parm 7 I>[#356 Parm] [flags 0x40000103 0x0 ] [ 0x125d61150] bci=[-1,19,8] rc=0 vc=1500 vn=- li=10 udi=2 nc=1
n33n iadd [ 0x125d61100] bci=[-1,17,8] rc=2 vc=1500 vn=- li=- udi=- nc=2
n31n iload <parm 5 I>[#354 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [ 0x125d61060] bci=[-1,13,8] rc=1 vc=1500 vn=- li=- udi=17 nc=0 flg=0x1000
n32n iload <parm 6 I>[#355 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [ 0x125d610b0] bci=[-1,15,8] rc=1 vc=1500 vn=- li=- udi=18 nc=0 flg=0x1000
n42n istore <parm 2 I>[#351 Parm] [flags 0x40000103 0x0 ] [ 0x125d613d0] bci=[-1,29,8] rc=0 vc=1500 vn=- li=11 udi=3 nc=1
n41n iadd [ 0x125d61380] bci=[-1,28,8] rc=1 vc=1500 vn=- li=- udi=- nc=2
n40n iadd [ 0x125d61330] bci=[-1,27,8] rc=1 vc=1500 vn=- li=- udi=- nc=2
n38n iadd [ 0x125d61290] bci=[-1,24,8] rc=1 vc=1500 vn=- li=- udi=- nc=2
n36n iadd [ 0x125d611f0] bci=[-1,21,8] rc=1 vc=1500 vn=- li=- udi=- nc=2
n30n iadd [ 0x125d61010] bci=[-1,12,8] rc=1 vc=1500 vn=- li=- udi=- nc=2
n27n ==>iadd
n24n ==>iadd
n33n ==>iadd
n37n iload <parm 4 I>[#353 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [ 0x125d61240] bci=[-1,22,8] rc=1 vc=1500 vn=- li=22 udi=19 nc=0 flg=0x1000
n39n iload <parm 7 I>[#356 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [ 0x125d612e0] bci=[-1,25,8] rc=1 vc=1500 vn=- li=23 udi=20 nc=0 flg=0x1000
n21n iload <parm 2 I>[#351 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [ 0x125d60d40] bci=[-1,0,8] rc=1 vc=1500 vn=- li=24 udi=21 nc=0 flg=0x1000
After general store sinking:
n1703n istore <parm 2 I>[#351 Parm] [flags 0x40000103 0x0 ] [ 0x1260fc410] bci=[-1,29,8] rc=0 vc=1593 vn=- li=- udi=3 nc=1
n1704n iadd [ 0x1260fc460] bci=[-1,28,8] rc=1 vc=1593 vn=- li=- udi=- nc=2
n1705n iadd [ 0x1260fc4b0] bci=[-1,27,8] rc=1 vc=1593 vn=- li=- udi=- nc=2
n1706n iadd [ 0x1260fc500] bci=[-1,24,8] rc=1 vc=1593 vn=- li=- udi=- nc=2
n1707n iadd [ 0x1260fc550] bci=[-1,21,8] rc=1 vc=1593 vn=- li=- udi=- nc=2
n1708n iadd [ 0x1260fc5a0] bci=[-1,12,8] rc=1 vc=1593 vn=- li=- udi=- nc=2
n1709n iadd [ 0x1260fc5f0] bci=[-1,8,8] rc=1 vc=1595 vn=- li=- udi=- nc=2
n1710n iload <parm 2 I>[#351 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [ 0x1260fc640] bci=[-1,6,8] rc=1 vc=1595 vn=- li=- udi=15 nc=0 flg=0x1000
n1711n iload <parm 3 I>[#352 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [ 0x1260fc690] bci=[-1,7,8] rc=1 vc=1595 vn=- li=- udi=16 nc=0 flg=0x1000
n1712n iadd [ 0x1260fc6e0] bci=[-1,5,8] rc=1 vc=1596 vn=- li=- udi=- nc=2
n1713n iload <parm 4 I>[#353 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [ 0x1260fc730] bci=[-1,1,8] rc=1 vc=1596 vn=- li=- udi=13 nc=0 flg=0x1000
n1714n iload <parm 7 I>[#356 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [ 0x1260fc780] bci=[-1,3,8] rc=1 vc=1596 vn=- li=- udi=14 nc=0 flg=0x1000
n1715n iadd [ 0x1260fc7d0] bci=[-1,17,8] rc=1 vc=1594 vn=- li=- udi=- nc=2
n1716n iload <parm 5 I>[#354 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [ 0x1260fc820] bci=[-1,13,8] rc=1 vc=1594 vn=- li=- udi=17 nc=0 flg=0x1000
n1717n iload <parm 6 I>[#355 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [ 0x1260fc870] bci=[-1,15,8] rc=1 vc=1594 vn=- li=- udi=18 nc=0 flg=0x1000
n1718n iload <parm 4 I>[#353 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [ 0x1260fc8c0] bci=[-1,22,8] rc=1 vc=1593 vn=- li=- udi=19 nc=0 flg=0x1000
n1719n iload <parm 7 I>[#356 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [ 0x1260fc910] bci=[-1,25,8] rc=1 vc=1593 vn=- li=- udi=20 nc=0 flg=0x1000
n1720n iload <parm 2 I>[#351 Parm] [flags 0x40000103 0x0 ] (cannotOverflow ) [ 0x1260fc960] bci=[-1,0,8] rc=1 vc=1593 vn=- li=- udi=21 nc=0 flg=0x1000
I think what's happening is that general store sinking doesn't take into account the possibility that a store to a local might occur between the point at which a commoned load first appears and the candidate for store sinking where the commoned load is actually used.
In this example, n24n uses #353 and #356. After that, there are stores to those two local symbols at n28n and n34n, respectively. Finally, n24n is used in the tree of the candidate for store sinking at n42n.
After store sinking, n28n and n34n remain in their original block, and all the references to #353 and #356 in n1703n use the values loaded from those symbols after they've been stored to.
I have a fix in progress for this that I hope to complete for the 0.44 release.
I haven't managed to progress my prototype fix for this problem, but I'd like to try to get this fixed. Moving it to the 0.48 release.
I'm going to push this out to the "Future" release for now, but I have marked it as high priority, as I have a fix that I have prototyped.