corretto-17
corretto-17 copied to clipboard
JVM crash with SIGSEGV
Describe the bug
What: After updating to amazoncorretto:17 we've seen irregular JVM-crashes for a workload with below log. The crash usually happens within the first 5 minutes after starting the workload. Up until the crash, the workload works as expected.
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x0000000000000000, pid=1, tid=14
#
# JRE version: OpenJDK Runtime Environment Corretto-17.0.2.8.1 (17.0.2+8) (build 17.0.2+8-LTS)
# Java VM: OpenJDK 64-Bit Server VM Corretto-17.0.2.8.1 (17.0.2+8-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64)
# Problematic frame:
# C 0x0000000000000000
#
# Core dump will be written. Default location: //core.1
#
# An error report file with more information is saved as:
# //hs_err_pid1.log
#
# Compiler replay data is saved as:
# //replay_pid1.log
#
# If you would like to submit a bug report, please visit:
# https://github.com/corretto/corretto-17/issues/
#
[error occurred during error reporting (), id 0xb, SIGSEGV (0xb) at pc=0x00007fd7072cb23b]
How often: twice with a period of 7 days in between Where: Workload runs as a ECS Fargate task Dumps: None as the dumps were only written to ephemeral storage so far (if that worked as expected)
To Reproduce
No reliable reproduction as this happens very rarely.
Expected behavior
JVM does not crash. When JVM crashes, it is able to report the error correctly.
Platform information
OS: Amazon Linux 2
Version: Corretto-17.0.2.8.1 (17.0.2+8) (build 17.0.2+8-LTS) (see log above)
Base-image: public.ecr.aws/amazoncorretto/amazoncorretto:17
For VM crashes, please attach the error report file. By default the file name is hs_err_pidpid.log
, where pid is the process ID of the process. --> unfortunately not available currently, as this has only been written to ephemeral storage of the fargate task container.
Thank you for considering this report! If there is additional information I can provide to help with resolving this, please do not hesitate to reach out!
This will be tough to troubleshoot without a core file or an hs_error log. Is there no way to get these artifacts from the fargate container? Perhaps using ECS Exec? could the container mount some durable storage? Perhaps using EFS with ECS?
Are you able to set the command line flags for the java process? Could you try running with -XX:+ErrorFileToStdout
?
Thanks for the quick reply, @earthling-amzn!
I'll set up tooling to be prepared for the next crash and report back. Due to the irregularity of the crashes it might take a few days until I have more data. Thank you for your patience and understanding!
@earthling-amzn Good news! We've been able to observe another crash and your proposed option with -XX:+ErrorFileToStdout
resulted in an error log (see below). Please note that I have removed some information and marked it with {REDACTED}
.
With my limited understanding, it seems to be related to our use of Kotlin Co-Routine Flows, specifically the collect() method (at least this instance of the issue)?
Please do not hesitate to reach out, if I can support the process!
Thank you for your time and effort!
Thank you for sharing the crash log. To me, it looks like an issue with C2. I'm not very familiar with Kotlin Co-Routine Flows, so it would be helpful if you had a small bit of code to reproduce the crash. Do you know of any public projects that might use Flows? I could look there for benchmarks or tests to reproduce the crash.
It would be helpful to have the replay log from the compiler, could you have the JVM write that file out to persistent storage with -XX:ReplayDataFile=<path>
? Are you able to exercise this code outside of a container? If we gave you a fastdebug build of the JVM (i.e., one with assertions enabled), would you be able to run that in your container?
DataDog agent also does a fair amount of byte code instrumentation which could also confuse the JIT compiler. You might want to explore options there to disable instrumentation.
@earthling-amzn thanks again for the quick response!
Reproduction Unfortunately we still have not found a reliable way to reproduce the issue, which makes it very hard to build a limited scope reproduction code example. We still have not been able to reproduce the issue locally either, which might either be bad luck or some difference in environment (OSX + ARM locally vs Linux + x64 in our deployments). As soon as we find a reliable way to reproduce I will make sure to build a minimal reproduction example and share it with you.
Public projects I'm not aware of any projects that use something akin to our usage which consists of spring-reactor + kotlin co-routines in this case. I'll do some research on the weekend on this topic and share my findings.
Compiler replay log We've added the requested flag and waiting for another occurrence of the issue. Will report back as soon as I have more data.
Exercise code outside of a container Yes we're able to do this, but as mentioned before have not been able to reproduce the crash outside of a container running in Fargate.
Fastdebug build We should be able to run in our staging environment with a fastdebug build of the JVM, if you could provide that either as a AL2 docker image, that we can build upon (preferred as it is closer to our usage) or as binaries upon which we could build our own AL2+Corretto base image.
DataDog agent I will have a look at this, thanks for the advice!
Thanks again for your hard work and support on this issue!
I just want to clarify that we don't want to blame the DataDog agent for being responsible for the crash. It's just that through instrumentation the agent might create unusual byetcode patterns which the JIT compiler might be not prepared for. Excluding (or not) the DD agent as a reason for this crash might help to isolate the problem and potentially create a reproducer.
Thanks for your support, Volker
Thanks for the clarification, Volker!
I'd like to ensure that I'm able to provide individual data for each change I'm making. Since the crashes are highly infrequent, it will probably take some time until I've been able to gather data on the different scenarios.
Scenario 1 (currently waiting for crash): no changes, capture compiler log Scenario 2: exclude datadog agent Scenario 3: include fastdebug JVM build(?)
Here is a link to download a fastdebug build. The link will expire in 7 days (Feb 28th, 2022). Please note that although the fastdebug build is an optimized build, it has asserts enabled so it will run somewhat slower than the release build. It's to be hoped that an assert will catch the condition leading to the crash before it crashes and then terminate the VM with a helpful message.
@earthling-amzn Thank you for providing the fastdebug build! Unfortunately I get a ExpiredToken error when trying to access the link. Could you please re-generate the link?
Thanks in advance!
Sorry about that. Try this one.
Have you seen this crash in earlier versions of the JDK?
Thank you, the second link worked. I'll probably set it up tomorrow (due to meetings today) and a team mate of mine should be in touch soon.
We've only seen this issue in JDK 17. We've been recently upgrading from Corretto 11 to Corretto 17. We've only seen this happen in this specific service. Setup for services is pretty similar (Spring Boot + Kotlin + DataDog Agent on ECS).
Unfortunately we've not been able to capture the compiler replay log with -XX:ReplayDataFile=
as the process seems to be terminated before this can happen.
We've integrated the fastdebug build in one of our environments and will report back with more information on the next occurrence of the issue.
Please note that a colleague will continue the communication with you as I will be leaving the team. Thank you for your understanding!
@earthling-amzn It's been a time, but we had to try out a few things... We excluded the Datadog agent and let it run for a while with the fastdebug build. Now we could reproduce the crash once with your provided fastdebug build.
Find the log file here (some information has been anonymized): jvm-crash-2022-04-11.log
Probably you're mainly interested in the following?
# Internal Error (/home/jenkins/node/workspace/Corretto17/generic_linux/x64/build/Corretto17Src/installers/linux/universal/tar/corretto-build/buildRoot/src/hotspot/share/c1/c1_Instruction.cpp:848), pid=1, tid=22
# assert(existing_value == new_state->local_at(index) || (existing_value->as_Phi() != __null && existing_value->as_Phi()->as_Phi()->block() == this)) failed: phi function required
Hope this helps! Let me know in case of further questions, as I'll take up the communication from @aablsk.
That's very interesting and helps narrow the search. I don't suppose you have the compilation replay file given by -XX:ReplayDataFile=./replay_pid1.log
?
This crash sure looks like: Hotspot C1 compiler crashes on Kotlin suspend fun with loop Which is patched in the 17.0.3 release. 17.0.3 is scheduled for release on April 19th, 2022.
This is all good news, but I'm a little concerned that the original crash for this issue was in C2. You might want to disable tiered compilation with -XX:-TieredCompilation
. This will effectively disable the C1 compiler (where this latest crash occurred) and will have all code compiled by C2 (where the crash in the original report occurred). Maybe just disable tiered compilation where you are running the fastdebug
build?
Thanks for the hint. And sorry, no I don't have the replay file.
I disabled tiered compilation when running the fastdebug
build and will monitor if the crash occurs again.
Since running the fastdebug
build with disabled tierd compilation and without Datadog agent, the crash did not occur again on our development system. The system is not under high load though.
However, with JRE 17.0.3, the JVM still crashes on production:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x0000000000000000, pid=1, tid=14
#
# JRE version: OpenJDK Runtime Environment Corretto-17.0.3.6.1 (17.0.3+6) (build 17.0.3+6-LTS)
# Java VM: OpenJDK 64-Bit Server VM Corretto-17.0.3.6.1 (17.0.3+6-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64)
# Problematic frame:
# C 0x0000000000000000
#
# Core dump will be written. Default location: //core.1
#
# If you would like to submit a bug report, please visit:
# https://github.com/corretto/corretto-17/issues/
#
--------------- S U M M A R Y ------------
Command Line: -XX:MaxRAMPercentage=70 -XX:+ErrorFileToStdout -XX:ReplayDataFile=./replay_pid1.log -javaagent:./dd-java-agent.jar cloud.rio.marketplace.productactivation.ProductActivationApplicationKt
Host: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz, 2 cores, 1G, Amazon Linux release 2 (Karoo)
Time: Thu Apr 28 07:50:45 2022 UTC elapsed time: 79.554398 seconds (0d 0h 1m 19s)
--------------- T H R E A D ---------------
Current thread (0x00007f79c806db50): JavaThread "C2 CompilerThread0" daemon [_thread_in_native, id=14, stack(0x00007f799beff000,0x00007f799c000000)]
Current CompileTask:
C2: 79554 21820 ! 4 kotlinx.coroutines.reactive.PublisherAsFlow::collectImpl (410 bytes)
Stack: [0x00007f799beff000,0x00007f799c000000], sp=0x00007f799bffba68, free space=1010k
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000000
...
Do you have the rest of that crash report? The replay file would also be very helpful to root cause the issue.
Find the full crash report here.
I don't have the replay file unfortunately, because the service is running on AWS Fargate without a persistent volume.
This is the same error as your original reported. c2 fails to compile 'kotlinx.coroutines.reactive.PublisherAsFlow::collectImpl (410 bytes)'.
With -XX:ReplayDataFile=./replay_pid1.log, it's very likely we can produce this error. is it possible that you write it somewhere with persistent storage?
I implemented persisting the replay data file and will let you know when it is available.
I now have a replay file at hand. Should I upload it here? Is it fine if I anonymize some information (i.e. replace the package names)? Otherwise, how can I provide you safely with this file? Or is there anything else I need to do?
You may anonymize the file and upload it here and we'll see how far we get with it. We'll also look into ways to better exchange confidential files.
Here it is: 2022-05-05_replay_anonymized.log
I hope you can make some value out of it. Let me know if you need anything else.
hi, @fknrio I try to reproduce your replay file. One blocker is that your compilation unit contains 2 lambda classes.
6 16 reactor/core/publisher/Mono$$Lambda$2661+0x0000000801a70570 <init>
6 16 reactor/core/publisher/Flux$$Lambda$2755+0x0000000801ab0450 <init>
Those classes are generated dynamically. I don't have the class files so we can't trigger compilation on my side. Here is one workaround for this issue. You can pass the following option to java. 'DUMP_CLASS_FILES' is a directory and you need to create it before executing. This will force java to dump all lambda classes to 'DUMP_CLASS_FILES'.
-Djdk.internal.lambda.dumpProxyClasses=DUMP_CLASS_FILES
Can you try that? or have a simple reproducible ( either in source code or a jar file) so we can step into it?
hi, @fknrio, It's also possible to recover the missing classes from a corefile. if it's difficult to reproduce this problem from source code, how about you share the coredump file with us?
Hi @navyxliu, I configured the appropriate options and will share the respective files once the crash occurs the next time.
Unfortunately, I don't have a simple reproducible, because also for us it only happens in a single service (although others are built very similar).
Hi @navyxliu, I dumped the proxy classes (and excluded the com.example
package): class_dump.tar.gz together with this replay.log
Hope this already helps?