NullPointerException from non-null local variable in C2 compiled code
Please provide a brief summary of the bug
We are seeing NullPointerExceptions thrown while trying to invoke a method on a local variable that is guaranteed to be non-null. The local variable is only assigned in one place in this method, from a new/dup combo.
I am reasonably confident that it's not a hardware problem as our telemetry shows it coming from > 700 individual users within the last 24 hours. Most of these users have no other telemetry events, including VM crashes and other errors.
I used a compiler directive to exclude the failing methods from C2 compilation and the exceptions went away almost immediately, which is why I believe this to be a C2 problem.
Unfortunately, I have not come up with a functioning reproducer for this and it only happens rarely on users' machines, so what I can reasonably collect is somewhat limited. I have tried repeatedly loading the class in a fresh class loader and exercising it with randomly selected inputs to no avail.
According to our telemetry, this problem is exclusive to Windows 11 with a 64-bit JVM, which is only ~40% of our userbase. The vast majority of our users are on various versions of Adoptium 11, and this is where I see most of our errors from. I have seen a few cases on Java 17 from various vendors as well, but this is a very uncommon configuration.
The relevant classes are in this jar, methods jk.gk(Ljk;[B)V and jk.au([B)V, both of which are only called by the ([B) constructor. I can compile a dataset with all of the possible inputs to this method if useful.
The exceptions look like this. This one is from Java 17 for clarity, but most of them are from Java 11 and thus lacking the message.
java.lang.NullPointerException: Cannot invoke "vy.dj(int)" because "<local4>" is null
at jk.au(jk.java:264)
at jk.<init>(jk.java:16)
at jk.ab(jk.java:60)
...
Did you test with the latest update version?
- [x] Yes
Please provide steps to reproduce where possible
No response
Expected Results
Method to not throw.
Actual Results
Method rarely throws.
What Java Version are you using?
11.0.26
What is your operating system and platform?
Windows 11 amd64
How did you install Java?
No response
Did it work before?
Did you test with other Java versions?
Eclipse Adoptium 11.0.16.1
Eclipse Adoptium 11.0.18
Eclipse Adoptium 11.0.19
Eclipse Adoptium 11.0.20
Eclipse Adoptium 11.0.21
Eclipse Adoptium 11.0.22
Eclipse Adoptium 11.0.24
Eclipse Adoptium 11.0.25
Eclipse Adoptium 11.0.26
Oracle Corporation 17.0.12
OpenLogic 17.0.13
Eclipse Adoptium 17.0.13
Relevant log output
Which version of Java 17 is this crashing in? Does it not crash in the other tested versions you mentioned above?
I have seen crashes in our telemetry for all of the versions I listed as tested. I don't have enough data from Java 17 on Windows to say if any versions are working.
Hmm, I just saw above that it's only on Windows x86-64? It's very odd to get an NPE that's platform specific (usually you'd get a JVM crash with a crash log). Are you able to share the source code and context here?
Yeah, I agree its very odd, not only is it Windows specific, but it seems to be Windows 11 specific.
These methods deserialize 3D models out of a byte array into their in-memory representation. It tends to be called in large bursts when new areas of the map are loaded, then in smaller bursts for players/NPCs/interface components. I have seen crashes in both situations. I have attached deobfuscated, decompiled, and trimmed down "source" version of them methods. I can't share the actual source, and it is ran through a proprietary obfuscator and then through a second bytecode patching pass, so the actual bytecode in the class is different to what javac would produce anyway. Fortunately these methods are relatively lite on the obfuscation, really only having extra gotos inserted. It is easy to verify with javap or asm textifier that there are no other assignments to the problematic local variables that don't appear in the "source".
au is decodeType3 and gk is decodeType2. The other 2 decoders are effectively dead code. The stack trace in the OP lines up with line 239 in this file. I have seen crashes on other lines as well, I could go through and match them up if that would be useful.
ModelData.java.txt
Perhaps its also useful to add that the decoders are only called by the constructor, and the constructor is only called by this:
public static ModelData loadModelData(IndexDataBase var0, int var1, int var2) {
byte[] var3 = var0.loadData(var1, var2);
return var3 == null?null:new ModelData(var3);
}
So the constructor and decoders can never be passed a null, so its not it misattributing a real NPE to an incorrect location.
Is this for Minecraft? Do you have the LWJGL & graphic driver versions associate with these crash reports?
No, this is not Minecraft or Minecraft-adjacent. This is in the closed-source portion of RuneLite. We do use LWJGL as well though, we ship 3.3.2 currently. We don't associate graphics driver versions with these reports, since generally driver bugs either result in a full JVM crash or non-fatal failures that wouldn't show on telemetry anyway. I have not seen a elevated rate of JVM crashes from users that have exhibited this crash.
Hmm, I've got no good explanation for this. If it's not native code that's being invoked but it's only happening on Windows 11 - then this is pretty odd.
I'd recommend posting this issue to https://mail.openjdk.org/mailman/listinfo/hotspot-runtime-dev and asking for some extra debug flags that they might recommend to help narrow this down
We are marking this issue as stale because it has not been updated for a while. This is just a way to keep the support issues queue manageable. It will be closed soon unless the stale label is removed by a committer, or a new comment is made.