corretto-17 icon indicating copy to clipboard operation
corretto-17 copied to clipboard

JVM crash with SIGSEGV

Open aablsk opened this issue 3 years ago • 61 comments

Describe the bug

What: After updating to amazoncorretto:17 we've seen irregular JVM-crashes for a workload with below log. The crash usually happens within the first 5 minutes after starting the workload. Up until the crash, the workload works as expected.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000000000000, pid=1, tid=14
#
# JRE version: OpenJDK Runtime Environment Corretto-17.0.2.8.1 (17.0.2+8) (build 17.0.2+8-LTS)
# Java VM: OpenJDK 64-Bit Server VM Corretto-17.0.2.8.1 (17.0.2+8-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64)
# Problematic frame:
# C  0x0000000000000000
#
# Core dump will be written. Default location: //core.1
#
# An error report file with more information is saved as:
# //hs_err_pid1.log
#
# Compiler replay data is saved as:
# //replay_pid1.log
#
# If you would like to submit a bug report, please visit:
#   https://github.com/corretto/corretto-17/issues/
#
[error occurred during error reporting (), id 0xb, SIGSEGV (0xb) at pc=0x00007fd7072cb23b]

How often: twice with a period of 7 days in between Where: Workload runs as a ECS Fargate task Dumps: None as the dumps were only written to ephemeral storage so far (if that worked as expected)

To Reproduce

No reliable reproduction as this happens very rarely.

Expected behavior

JVM does not crash. When JVM crashes, it is able to report the error correctly.

Platform information

OS: Amazon Linux 2
Version: Corretto-17.0.2.8.1 (17.0.2+8) (build 17.0.2+8-LTS) (see log above)
Base-image: public.ecr.aws/amazoncorretto/amazoncorretto:17

For VM crashes, please attach the error report file. By default the file name is hs_err_pidpid.log, where pid is the process ID of the process. --> unfortunately not available currently, as this has only been written to ephemeral storage of the fargate task container.

Thank you for considering this report! If there is additional information I can provide to help with resolving this, please do not hesitate to reach out!

aablsk avatar Feb 16 '22 12:02 aablsk

This will be tough to troubleshoot without a core file or an hs_error log. Is there no way to get these artifacts from the fargate container? Perhaps using ECS Exec? could the container mount some durable storage? Perhaps using EFS with ECS?

Are you able to set the command line flags for the java process? Could you try running with -XX:+ErrorFileToStdout?

earthling-amzn avatar Feb 16 '22 16:02 earthling-amzn

Thanks for the quick reply, @earthling-amzn!

I'll set up tooling to be prepared for the next crash and report back. Due to the irregularity of the crashes it might take a few days until I have more data. Thank you for your patience and understanding!

aablsk avatar Feb 17 '22 07:02 aablsk

@earthling-amzn Good news! We've been able to observe another crash and your proposed option with -XX:+ErrorFileToStdout resulted in an error log (see below). Please note that I have removed some information and marked it with {REDACTED}.

With my limited understanding, it seems to be related to our use of Kotlin Co-Routine Flows, specifically the collect() method (at least this instance of the issue)?

Please do not hesitate to reach out, if I can support the process!

Thank you for your time and effort!

error_log_jvm_crash_corretto_17.log

aablsk avatar Feb 17 '22 11:02 aablsk

Thank you for sharing the crash log. To me, it looks like an issue with C2. I'm not very familiar with Kotlin Co-Routine Flows, so it would be helpful if you had a small bit of code to reproduce the crash. Do you know of any public projects that might use Flows? I could look there for benchmarks or tests to reproduce the crash.

earthling-amzn avatar Feb 17 '22 16:02 earthling-amzn

It would be helpful to have the replay log from the compiler, could you have the JVM write that file out to persistent storage with -XX:ReplayDataFile=<path>? Are you able to exercise this code outside of a container? If we gave you a fastdebug build of the JVM (i.e., one with assertions enabled), would you be able to run that in your container?

DataDog agent also does a fair amount of byte code instrumentation which could also confuse the JIT compiler. You might want to explore options there to disable instrumentation.

earthling-amzn avatar Feb 17 '22 17:02 earthling-amzn

@earthling-amzn thanks again for the quick response!

Reproduction Unfortunately we still have not found a reliable way to reproduce the issue, which makes it very hard to build a limited scope reproduction code example. We still have not been able to reproduce the issue locally either, which might either be bad luck or some difference in environment (OSX + ARM locally vs Linux + x64 in our deployments). As soon as we find a reliable way to reproduce I will make sure to build a minimal reproduction example and share it with you.

Public projects I'm not aware of any projects that use something akin to our usage which consists of spring-reactor + kotlin co-routines in this case. I'll do some research on the weekend on this topic and share my findings.

Compiler replay log We've added the requested flag and waiting for another occurrence of the issue. Will report back as soon as I have more data.

Exercise code outside of a container Yes we're able to do this, but as mentioned before have not been able to reproduce the crash outside of a container running in Fargate.

Fastdebug build We should be able to run in our staging environment with a fastdebug build of the JVM, if you could provide that either as a AL2 docker image, that we can build upon (preferred as it is closer to our usage) or as binaries upon which we could build our own AL2+Corretto base image.

DataDog agent I will have a look at this, thanks for the advice!

Thanks again for your hard work and support on this issue!

aablsk avatar Feb 18 '22 08:02 aablsk

I just want to clarify that we don't want to blame the DataDog agent for being responsible for the crash. It's just that through instrumentation the agent might create unusual byetcode patterns which the JIT compiler might be not prepared for. Excluding (or not) the DD agent as a reason for this crash might help to isolate the problem and potentially create a reproducer.

Thanks for your support, Volker

simonis avatar Feb 18 '22 09:02 simonis

Thanks for the clarification, Volker!

I'd like to ensure that I'm able to provide individual data for each change I'm making. Since the crashes are highly infrequent, it will probably take some time until I've been able to gather data on the different scenarios.

Scenario 1 (currently waiting for crash): no changes, capture compiler log Scenario 2: exclude datadog agent Scenario 3: include fastdebug JVM build(?)

aablsk avatar Feb 18 '22 10:02 aablsk

Here is a link to download a fastdebug build. The link will expire in 7 days (Feb 28th, 2022). Please note that although the fastdebug build is an optimized build, it has asserts enabled so it will run somewhat slower than the release build. It's to be hoped that an assert will catch the condition leading to the crash before it crashes and then terminate the VM with a helpful message.

earthling-amzn avatar Feb 21 '22 19:02 earthling-amzn

@earthling-amzn Thank you for providing the fastdebug build! Unfortunately I get a ExpiredToken error when trying to access the link. Could you please re-generate the link?

Thanks in advance!

aablsk avatar Feb 22 '22 08:02 aablsk

Sorry about that. Try this one.

earthling-amzn avatar Feb 22 '22 16:02 earthling-amzn

Have you seen this crash in earlier versions of the JDK?

earthling-amzn avatar Feb 22 '22 21:02 earthling-amzn

Thank you, the second link worked. I'll probably set it up tomorrow (due to meetings today) and a team mate of mine should be in touch soon.

We've only seen this issue in JDK 17. We've been recently upgrading from Corretto 11 to Corretto 17. We've only seen this happen in this specific service. Setup for services is pretty similar (Spring Boot + Kotlin + DataDog Agent on ECS).

aablsk avatar Feb 23 '22 07:02 aablsk

Unfortunately we've not been able to capture the compiler replay log with -XX:ReplayDataFile= as the process seems to be terminated before this can happen.

We've integrated the fastdebug build in one of our environments and will report back with more information on the next occurrence of the issue.

Please note that a colleague will continue the communication with you as I will be leaving the team. Thank you for your understanding!

aablsk avatar Feb 24 '22 08:02 aablsk

@earthling-amzn It's been a time, but we had to try out a few things... We excluded the Datadog agent and let it run for a while with the fastdebug build. Now we could reproduce the crash once with your provided fastdebug build.

Find the log file here (some information has been anonymized): jvm-crash-2022-04-11.log

Probably you're mainly interested in the following?

#  Internal Error (/home/jenkins/node/workspace/Corretto17/generic_linux/x64/build/Corretto17Src/installers/linux/universal/tar/corretto-build/buildRoot/src/hotspot/share/c1/c1_Instruction.cpp:848), pid=1, tid=22
#  assert(existing_value == new_state->local_at(index) || (existing_value->as_Phi() != __null && existing_value->as_Phi()->as_Phi()->block() == this)) failed: phi function required

Hope this helps! Let me know in case of further questions, as I'll take up the communication from @aablsk.

fknrio avatar Apr 11 '22 07:04 fknrio

That's very interesting and helps narrow the search. I don't suppose you have the compilation replay file given by -XX:ReplayDataFile=./replay_pid1.log ?

earthling-amzn avatar Apr 11 '22 15:04 earthling-amzn

This crash sure looks like: Hotspot C1 compiler crashes on Kotlin suspend fun with loop Which is patched in the 17.0.3 release. 17.0.3 is scheduled for release on April 19th, 2022.

This is all good news, but I'm a little concerned that the original crash for this issue was in C2. You might want to disable tiered compilation with -XX:-TieredCompilation. This will effectively disable the C1 compiler (where this latest crash occurred) and will have all code compiled by C2 (where the crash in the original report occurred). Maybe just disable tiered compilation where you are running the fastdebug build?

earthling-amzn avatar Apr 11 '22 15:04 earthling-amzn

Thanks for the hint. And sorry, no I don't have the replay file.

I disabled tiered compilation when running the fastdebug build and will monitor if the crash occurs again.

fknrio avatar Apr 12 '22 08:04 fknrio

Since running the fastdebug build with disabled tierd compilation and without Datadog agent, the crash did not occur again on our development system. The system is not under high load though.

However, with JRE 17.0.3, the JVM still crashes on production:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000000000000, pid=1, tid=14
#
# JRE version: OpenJDK Runtime Environment Corretto-17.0.3.6.1 (17.0.3+6) (build 17.0.3+6-LTS)
# Java VM: OpenJDK 64-Bit Server VM Corretto-17.0.3.6.1 (17.0.3+6-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64)
# Problematic frame:
# C  0x0000000000000000
#
# Core dump will be written. Default location: //core.1
#
# If you would like to submit a bug report, please visit:
#   https://github.com/corretto/corretto-17/issues/
#
---------------  S U M M A R Y ------------
Command Line: -XX:MaxRAMPercentage=70 -XX:+ErrorFileToStdout -XX:ReplayDataFile=./replay_pid1.log -javaagent:./dd-java-agent.jar cloud.rio.marketplace.productactivation.ProductActivationApplicationKt
Host: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz, 2 cores, 1G, Amazon Linux release 2 (Karoo)
Time: Thu Apr 28 07:50:45 2022 UTC elapsed time: 79.554398 seconds (0d 0h 1m 19s)
---------------  T H R E A D  ---------------
Current thread (0x00007f79c806db50):  JavaThread "C2 CompilerThread0" daemon [_thread_in_native, id=14, stack(0x00007f799beff000,0x00007f799c000000)]
Current CompileTask:
C2:  79554 21820   !   4       kotlinx.coroutines.reactive.PublisherAsFlow::collectImpl (410 bytes)
Stack: [0x00007f799beff000,0x00007f799c000000],  sp=0x00007f799bffba68,  free space=1010k
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000000
...

fknrio avatar Apr 29 '22 14:04 fknrio

Do you have the rest of that crash report? The replay file would also be very helpful to root cause the issue.

earthling-amzn avatar Apr 29 '22 15:04 earthling-amzn

Find the full crash report here.

I don't have the replay file unfortunately, because the service is running on AWS Fargate without a persistent volume.

fknrio avatar Apr 29 '22 15:04 fknrio

This is the same error as your original reported. c2 fails to compile 'kotlinx.coroutines.reactive.PublisherAsFlow::collectImpl (410 bytes)'.

With -XX:ReplayDataFile=./replay_pid1.log, it's very likely we can produce this error. is it possible that you write it somewhere with persistent storage?

navyxliu avatar Apr 29 '22 17:04 navyxliu

I implemented persisting the replay data file and will let you know when it is available.

fknrio avatar May 03 '22 11:05 fknrio

I now have a replay file at hand. Should I upload it here? Is it fine if I anonymize some information (i.e. replace the package names)? Otherwise, how can I provide you safely with this file? Or is there anything else I need to do?

fknrio avatar May 05 '22 08:05 fknrio

You may anonymize the file and upload it here and we'll see how far we get with it. We'll also look into ways to better exchange confidential files.

earthling-amzn avatar May 05 '22 16:05 earthling-amzn

Here it is: 2022-05-05_replay_anonymized.log

I hope you can make some value out of it. Let me know if you need anything else.

fknrio avatar May 06 '22 08:05 fknrio

hi, @fknrio I try to reproduce your replay file. One blocker is that your compilation unit contains 2 lambda classes.

 6 16 reactor/core/publisher/Mono$$Lambda$2661+0x0000000801a70570 <init> 
 6 16 reactor/core/publisher/Flux$$Lambda$2755+0x0000000801ab0450 <init> 

Those classes are generated dynamically. I don't have the class files so we can't trigger compilation on my side. Here is one workaround for this issue. You can pass the following option to java. 'DUMP_CLASS_FILES' is a directory and you need to create it before executing. This will force java to dump all lambda classes to 'DUMP_CLASS_FILES'.

-Djdk.internal.lambda.dumpProxyClasses=DUMP_CLASS_FILES

Can you try that? or have a simple reproducible ( either in source code or a jar file) so we can step into it?

navyxliu avatar May 13 '22 01:05 navyxliu

hi, @fknrio, It's also possible to recover the missing classes from a corefile. if it's difficult to reproduce this problem from source code, how about you share the coredump file with us?

navyxliu avatar May 16 '22 18:05 navyxliu

Hi @navyxliu, I configured the appropriate options and will share the respective files once the crash occurs the next time.

Unfortunately, I don't have a simple reproducible, because also for us it only happens in a single service (although others are built very similar).

fknrio avatar May 17 '22 08:05 fknrio

Hi @navyxliu, I dumped the proxy classes (and excluded the com.example package): class_dump.tar.gz together with this replay.log

Hope this already helps?

fknrio avatar May 18 '22 13:05 fknrio