adoptium-support icon indicating copy to clipboard operation
adoptium-support copied to clipboard

SIGSEGV in PhaseIdealLoop::build_loop_late_post_work

Open tivervac opened this issue 3 years ago • 26 comments

Summary

We run an Eclipse-based product obfuscated using ZKM. Running our tests on CI has been causing frequent SIGSEGV's.

Steps to reproduce

The error is "rare" (one in 10 builds usually, on some of our branches it's 3/4, on others 1/20),

See our hs_err_pid, replay_pid and core dump (2.3 GB)

We're determined to help you help us. If there's anything more we can do, please let us know. We're trying to minimize this to a reproducible example, but that will take time, and definitely won't be easy due to the extreme flakiness of the failure.

Expected results

No crash

Actual results

Random SIGSEGV's likely heavily influenced by code layout and timings.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007ff7086a270b, pid=27153, tid=27169
#
# JRE version: OpenJDK Runtime Environment Temurin-17.0.3+7 (17.0.3+7) (build 17.0.3+7)
# Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.3+7 (17.0.3+7, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0xac870b]  PhaseIdealLoop::build_loop_late_post_work(Node*, bool)+0x13b
#
# Core dump will be written. Default location: /home/jenkins/agent/workspace/line_8382-speed-up-product-build/com.sigasi.hdt.vhdl.test.projects/core.27153
#
# An error report file with more information is saved as:
# /home/jenkins/agent/workspace/line_8382-speed-up-product-build/com.sigasi.hdt.vhdl.test.projects/hs_err_pid27153.log
#
# Compiler replay data is saved as:
# /home/jenkins/agent/workspace/line_8382-speed-up-product-build/com.sigasi.hdt.vhdl.test.projects/replay_pid27153.log
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
#

Triaging info

Java version:

OpenJDK Runtime Environment Temurin-17.0.3+7 (17.0.3+7) (build 17.0.3+7)

What is your operating system and platform?

Amazon Linux release 2 (Karoo) on x86-64

How did you install Java?

Binary archive, tar.gz.

Did it work before?

We've been having this issue for months, since it relies on timings and code layouts there are periods in which we have many failures, then weeks without any.

Did you test with other Java versions?

Been having this since < Java 17. We haven't tried other VMs such as Graal or OpenJ9.

We've been faithfully upgrading to the latest Temurin version since Java 11, up to 12, 13, 14, 15 and now 17. (we skipped 16).

tivervac avatar Jun 21 '22 14:06 tivervac

could be https://bugs.openjdk.org/browse/JDK-8283386

jerboaa avatar Jun 21 '22 15:06 jerboaa

We don't use Lucene nor JavaFX, but that is the only link I found to PhaseIdealLoop::build_loop_late_post_work as well

tivervac avatar Jun 21 '22 15:06 tivervac

We are marking this issue as stale because it has not been updated for a while. This is just a way to keep the support issues queue manageable. It will be closed soon unless the stale label is removed by a committer, or a new comment is made.

github-actions[bot] avatar Sep 21 '22 01:09 github-actions[bot]

In the meantime, the linked bug has been split in two. The first appears irrelevant to us, we still encounter the issue with JustJ 17.0.4. The second is probably the one we need fixing.

tivervac avatar Sep 23 '22 13:09 tivervac

17.0.4.1 should have https://bugs.openjdk.org/browse/JDK-8275610 fixed (the first)

karianna avatar Sep 23 '22 14:09 karianna

Pinged about this at EclipseCon by @tivervac

tellison avatar Oct 26 '22 20:10 tellison

Maintainers are commenting on the upstream issue, but it's still open for now - https://bugs.openjdk.org/browse/JDK-8285835.

karianna avatar Oct 27 '22 00:10 karianna

It looks like there is some PR now: https://github.com/openjdk/jdk/pull/10894

I analyzed it and figured out where the issue happens with our Lucene code. The test case is hard to understand but it seems to happen if you have a loop over some code dereferencing object instances through multiple layers (A wraps B wraps C).

The same issue seems to also affect Ben Manes' Caffeine library: https://github.com/ben-manes/caffeine/issues/797

uschindler avatar Oct 28 '22 08:10 uschindler

@uschindler That's quite possible. By now we've been able to (temporarily?) work around this issue by not obfuscating one of our classes. We obfuscate using ZKM. That obfuscator likes to split off code and wrap it in other classes.

tivervac avatar Oct 28 '22 08:10 tivervac

So you don't have the source code of: C2: 25179 18398 4 com.sigasi.hdt.vhdl.effectanalysis.d::visitIdentifierPathElement (97 bytes)

??? Too bad. Maybe a disassembly of bytecode?

uschindler avatar Oct 28 '22 10:10 uschindler

Sadly, I had this info at the time of writing, but not anymore at this point.

I'll see whether I can reproduce it again

tivervac avatar Nov 03 '22 09:11 tivervac

Am I correct that upstream bug should be resolved in 19.0.2+7 ? I feel that we faced same problem using 19.0.2+7

vans239 avatar Jan 27 '23 11:01 vans239

No fix version is for Java 20

karianna avatar Jan 30 '23 23:01 karianna

@karianna How can I check it? I thought that https://bugs.openjdk.org/browse/JDK-8297510 corresponds to issue above and should be resolved in 19.0.2 build 7

vans239 avatar Jan 30 '23 23:01 vans239

https://bugs.openjdk.org/browse/JDK-8285835

karianna avatar Jan 30 '23 23:01 karianna

The link shows that the issue was backported image The issue is also shown in release notes for 19.0.2+7 https://www.oracle.com/java/technologies/javase/19all-relnotes.html

vans239 avatar Jan 30 '23 23:01 vans239

Fair point, should be fixed then. If you have a new crash log I can post it.

karianna avatar Jan 31 '23 01:01 karianna

We are marking this issue as stale because it has not been updated for a while. This is just a way to keep the support issues queue manageable. It will be closed soon unless the stale label is removed by a committer, or a new comment is made.

github-actions[bot] avatar May 02 '23 00:05 github-actions[bot]

@vans239 Are you able to try the latest LTS or Java 20.0.1 and let us know if this is resolved for you?

karianna avatar May 03 '23 02:05 karianna

Sadly even with the backported bug fix mentioned above, we're still encountering the issue with Temurin 17.0.7+7

tivervac avatar Aug 09 '23 14:08 tivervac

Are you able to try the latest LTS or Java 20.0.1 and let us know if this is resolved for you?

We were still observing issues with 19.0.2+7. I am trying currently 20.0.1+9 and not able reproduce so far. Will have more info next week when we rollout it fully to prod

vans239 avatar Aug 09 '23 17:08 vans239

Sadly even with the backported bug fix mentioned above, we're still encountering the issue with Temurin 17.0.7+7

I would try 17.0.8 JIC.

karianna avatar Aug 09 '23 18:08 karianna

Maybe related: https://github.com/openjdk/jdk/pull/15399 ... "SIGSEGV in PhaseIdealLoop::build_loop_late_post_work" ... but for a different reason. Backporting has not been even discussed on that one yet.

karniemi avatar Aug 24 '23 13:08 karniemi

Definitely possible, thanks for the link!

tivervac avatar Aug 28 '23 08:08 tivervac

Here are 2 crashes from Apache Pulsar, full hs_err_pid*.log files: https://gist.github.com/lhotari/53b72683ad4f339dfbcfd8b9b97062b9 .

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f927e8d5113, pid=3924, tid=4012
#
# JRE version: OpenJDK Runtime Environment Temurin-17.0.8.1+1 (17.0.8.1+1) (build 17.0.8.1+1)
# Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.8.1+1 (17.0.8.1+1, mixed mode, sharing, tiered, compressed class ptrs, z gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0xad5113]  PhaseIdealLoop::build_loop_late_post_work(Node*, bool)+0xe3
#

Happens in 17.0.8.1 . The Apache Pulsar issue is https://github.com/apache/pulsar/issues/19307 . Any help is appreciated.

lhotari avatar Oct 31 '23 21:10 lhotari

Looks like the previously posted GH issue links to https://bugs.openjdk.org/browse/JDK-8314024 which will be backported to 17.0.10 .

lhotari avatar Oct 31 '23 21:10 lhotari