apm-agent-java
apm-agent-java copied to clipboard
SIGSEGV Tomcat 10.1 Java 17
Hi,
Elastic Apm agent (version 1.46.0) causes SIGSEGV crash in JVM (17.0.10)
Then crash occures usually in 30min after kubernetes pod starts My application is using tomcat 10.1.18 docker image
In attachment you can find Error Report from JVM
Application works properly with disabled/removed Elastic APM Agent
is there any workaround for that? maybe some fix?
Does it still happen if you use a different garbage collector?
I didn't try it. My application uses G1GC
Should i try SerialGC?
On another enviroment my application uses SerialGC, but the problem still exists
Any chance of producing a test we can reproduce?
Can you post a couple of the other crashlogs too please
Also just to confirm, you haven't enabled inferred spans? You can see the properties that are set in the startup info (agent output with lines including StartupInfo)
I think that i didn't enable inferred spans
2024-02-06 21:00:51,269 [main] INFO co.elastic.apm.agent.util.JmxUtils - Found JVM-specific OperatingSystemMXBean interface: com.sun.management.OperatingSystemMXBean 2024-02-06 21:00:51,372 [main] INFO co.elastic.apm.agent.util.JmxUtils - Found JVM-specific ThreadMXBean interface: com.sun.management.ThreadMXBean 2024-02-06 21:00:51,395 [main] INFO co.elastic.apm.agent.configuration.StartupInfo - Starting Elastic APM 1.46.0 as app on Java 17.0.10 Runtime version: 17.0.10+7 VM version: 17.0.10+7 (Eclipse Adoptium) Linux 5.15.4-1.el7.elrepo.x86_64 2024-02-06 21:00:51,396 [main] INFO co.elastic.apm.agent.configuration.StartupInfo - service_name: 'app' (source: Java System Properties) 2024-02-06 21:00:51,396 [main] INFO co.elastic.apm.agent.configuration.StartupInfo - secret_token: 'XXXX' (source: Environment Variables) 2024-02-06 21:00:51,397 [main] INFO co.elastic.apm.agent.configuration.StartupInfo - server_urls: 'https://server1:8200,https://server1:8200' (source: Java System Properties) 2024-02-06 21:00:51,397 [main] INFO co.elastic.apm.agent.configuration.StartupInfo - application_packages: 'package1,package2,package3' (source: Java System Properties) 2024-02-06 21:00:53,178 [main] INFO co.elastic.apm.agent.impl.ElasticApmTracer - Tracer switched to RUNNING state 2024-02-06 21:00:54,004 [elastic-apm-server-healthcheck] INFO co.elastic.apm.agent.report.ApmServerHealthChecker - Elastic APM server is available: { "build_date": "2022-01-28T08:52:26Z", "build_sha": "a1459eb3f24e7c054794a82f6c19b66975098866", "publish_ready": true, "version": "7.17.0"} 2024-02-06 21:00:54,047 [elastic-apm-server-healthcheck] INFO co.elastic.apm.agent.report.ApmServerHealthChecker - Elastic APM server is available: { "build_date": "2022-01-28T08:52:26Z", "build_sha": "a1459eb3f24e7c054794a82f6c19b66975098866", "publish_ready": true, "version": "7.17.0"}
What is the most important for you in other crashlogs? Problematic frame?
J 20648 c2 co.elastic.apm.agent.loginstr.reformatting.AbstractEcsReformattingHelper.onAppendEnter(Ljava/lang/Object;)Z (124 bytes) @ 0x00007f20ae058090 [0x00007f20ae057ce0+0x00000000000003b0]
Are all the stacks the same in the crashlogs? If any are different we'd like to see any other stacks.
Two suggestions we'd like you to try separately:
- Generate a JITC log eg -XX:+UnlockDiagnosticVMOptions -Xlog:class+load=info -XX:+LogCompilation (more details here ) and specifically identify if there are any JITC actions just before the crash (or include the log here, we can look)
- Exclude the crashing method from JITC, eg adding to the command line -XX:CompileCommand=exclude,co/elastic/apm/agent/collections/CachedKeyWeakConcurrentMap.getLookupKey
Ad.2 I added
-XX:CompileCommand=exclude,co/elastic/apm/agent/collections/CachedKeyWeakConcurrentMap.getLookupKey
Now the problematic frame is : # J 2489 c2 co.elastic.apm.agent.weakconcurrent.CachedLookupKey.get(Ljava/lang/Object;)Lco/elastic/apm/agent/weakconcurrent/CachedLookupKey; (16 bytes) @ 0x00007f1168fcfbc6 [0x00007f1168fcfac0+0x0000000000000106]
ErrorFile in attachment
Did you get any info from the JITC logging?
Can we try more generic JITC disabling, by now adding to the command line
-XX:CompileCommand="exclude,co/elastic/apm/agent/collections/CachedKeyWeakConcurrentMap.*" -XX:CompileCommand="exclude,co/elastic/apm/agent/weakconcurrent/CachedLookupKey.*" -XX:CompileCommand="exclude,co/elastic/apm/agent/loginstr/reformatting/AbstractEcsReformattingHelper.*"
It's not at all clear why this is crashing, the hotspot logging is itself crashing (hence the incomplete crashlog) but it's successfully seeing everything in the Elastic data structure, so it's something outside that, it's a process of narrowing down by excluding other things
If you need an urgent workaround, you can disable the reformatting eg -Delastic.apm.disable_instrumentations=log-reformatting
JITC log will be available soon...
I Added below instruction to command line
-XX:CompileCommand="exclude,co/elastic/apm/agent/collections/CachedKeyWeakConcurrentMap.*" -XX:CompileCommand="exclude,co/elastic/apm/agent/weakconcurrent/CachedLookupKey.*" -XX:CompileCommand="exclude,co/elastic/apm/agent/loginstr/reformatting/AbstractEcsReformattingHelper.*"
and now problematic frame is: J 875 java.lang.System.identityHashCode(Ljava/lang/Object;)I [email protected] (0 bytes) @ 0x00007f5d84f568ce [0x00007f5d84f568a0+0x000000000000002e]
New error file is available in attachment
Yes, this is crashing inside the JVM native code in ObjectSynchronizer::FastHashCode. It's not at all clear why that would happen, we'll continue to look into it but I don't expect any quick solution, it's not even clear this is an agent bug. The best we can suggest is the workaround of disabling the log-reformatting given above.
If you can create a standalone example that you can pass to us to reproduce this crash, that would be awesome!
Hi @fr19k ,
We have another user with the same crash: Both crashes occur due to java.util logging reformatting on the same System.identityHashCode
invocation.
One thing I noticed is that both you and the other user specify the -java.util.logging.manager=org.apache.juli.ClassLoaderLogManager
flag to configure the log manager.
Could you maybe try again without that flag (so using the default logmanager) and check whether the error still occurs?
OK, i will check your suggestion with log manager
I can confirm that workaround with -Delastic.apm.disable_instrumentations=log-reformatting works properly (application is not crashing)
What is the default java.util.logging.manager ?
Tomcat seems to allows to overwrite java.util.logging.manager via LOGGING_MANAGER env, but i have specify proper class
I think it should be java.util.logging.LogManager
, though not 100% sure here. That class has a protected constructor, so we might need to create a subclass to reset it.
Alternatively you might be able to simply remove the part where it configures java.util.logging.manager
from the tomcat startup script.
If i set env variable LOGGING_MANAGER to java.util.logging.LogManager , application crashes at startup with below message
Error: Main method not found in class java.util.logging.LogManager, please define the main method as:
public static void main(String[] args)
or a JavaFX application class must extend javafx.application.Application
Based on this SO thread it looks like you need to provide the full arguments for the LOGGING_MANAGER
env variable:
LOGGING_MANAGER="-Djava.util.logging.manager=java.util.logging.LogManager"
If that doesn't work, I think we can also trick tomcat by setting LOGGING_MANAGER="-Ddummy.prop=foobar"
: I'm assuming here that the tomcat startup script checks if LOGGING_MANAGER
is empty and only in that case applies it's own log manager.
As @jackshirazi mentioned earlier, it would most likely be easier for us to find the root cause if you could provide us an application / docker image to allow us to reproduce this crash ourselves.
Hi @fr19k,
Given that the crash seems quite fast to trigger here, we'd like to have a bit more information on your app/setup to enable us attempting to reproduce it.
could you provide the following:
- agent configuration: some config is passed through JVM variables, but the log reformatting seems enabled through env variables.
- few implementation details on the app: what are the dependencies, does it uses
java.util.logging
for logging directly or is it only for the Tomcat part ? - how much activity is there on the app before it gets triggered ? for example is that an "intensive load test for 30min", or does it just crashes with few hundreds of transactions over 30min ? The number of transactions between the start and the time it crashed in Kibana could provide a good hint here. Also, do you have a rough estimate on the GC activity in terms of overhead or executions per minute ?
Hi,
agent configuration: some config is passed through JVM variables, but the log reformatting seems enabled through env variables.
Via env variables i'm passing ELASTIC_APM_SECRET_TOKEN, any other configs are passed through JVM variables
few implementation details on the app: what are the dependencies, does it uses java.util.logging for logging directly or is it only for the Tomcat part ?
if i set LOGGING_MANAGER to, "-Djava.util.logging.manager=java.util.logging.LogManager" or "-Ddummy.prop=foobar" i receive many errors from tomcat like
java.lang.ClassNotFoundException: 2localhost.org.apache.juli.AsyncFileHandler at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) at java.logging/java.util.logging.LogManager.createLoggerHandlers(LogManager.java:1005) at java.logging/java.util.logging.LogManager$4.run(LogManager.java:975) at java.logging/java.util.logging.LogManager$4.run(LogManager.java:971) at java.base/java.security.AccessController.doPrivileged(AccessController.java:318) at java.logging/java.util.logging.LogManager.loadLoggerHandlers(LogManager.java:971) at java.logging/java.util.logging.LogManager.initializeGlobalHandlers(LogManager.java:2424) at java.logging/java.util.logging.LogManager$RootLogger.addHandler(LogManager.java:2514) at java.logging/java.util.logging.LogManager.setLoggerHandlers(LogManager.java:989) at java.logging/java.util.logging.LogManager$4.run(LogManager.java:974) at java.logging/java.util.logging.LogManager$4.run(LogManager.java:971) at java.base/java.security.AccessController.doPrivileged(AccessController.java:318) at java.logging/java.util.logging.LogManager.loadLoggerHandlers(LogManager.java:971) at java.logging/java.util.logging.LogManager.addLogger(LogManager.java:1204) at java.logging/java.util.logging.LogManager$2.run(LogManager.java:398) at java.base/java.security.AccessController.doPrivileged(AccessController.java:318) at java.logging/java.util.logging.LogManager.ensureLogManagerInitialized(LogManager.java:383) at java.logging/java.util.logging.LogManager.getLogManager(LogManager.java:431) at java.logging/sun.util.logging.internal.LoggingProviderImpl.demandJULLoggerFor(LoggingProviderImpl.java:409) at java.logging/sun.util.logging.internal.LoggingProviderImpl.demandLoggerFor(LoggingProviderImpl.java:438) at java.base/jdk.internal.logger.DefaultLoggerFinder.getLogger(DefaultLoggerFinder.java:159) at java.base/jdk.internal.logger.LazyLoggers.getLoggerFromFinder(LazyLoggers.java:410) at java.base/jdk.internal.logger.LazyLoggers.getLazyLogger(LazyLoggers.java:465) at java.base/sun.util.logging.PlatformLogger.getLogger(PlatformLogger.java:297) at java.base/sun.net.www.protocol.http.HttpURLConnection.<clinit>(HttpURLConnection.java:436) at java.base/sun.net.www.protocol.http.Handler.openConnection(Handler.java:62) at java.base/java.net.URL.openConnection(URL.java:1144) at co.elastic.apm.agent.util.UrlConnectionUtils.openUrlConnectionThreadSafely(UrlConnectionUtils.java:47) at co.elastic.apm.agent.impl.metadata.CloudMetadataProvider.executeRequest(CloudMetadataProvider.java:418) at co.elastic.apm.agent.impl.metadata.CloudMetadataProvider.getAwsMetadata(CloudMetadataProvider.java:222) at co.elastic.apm.agent.impl.metadata.CloudMetadataProvider.access$000(CloudMetadataProvider.java:53) at co.elastic.apm.agent.impl.metadata.CloudMetadataProvider$1.call(CloudMetadataProvider.java:155) at co.elastic.apm.agent.impl.metadata.CloudMetadataProvider$1.call(CloudMetadataProvider.java:149) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at co.elastic.apm.agent.util.ExecutorUtils$2.run(ExecutorUtils.java:99) at java.base/java.lang.Thread.run(Thread.java:840) Can't load log handler "3manager.org.apache.juli.AsyncFileHandler" java.lang.ClassNotFoundException: 3manager.org.apache.juli.AsyncFileHandler java.lang.ClassNotFoundException: 3manager.org.apache.juli.AsyncFileHandler at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) at java.logging/java.util.logging.LogManager.createLoggerHandlers(LogManager.java:1005) at java.logging/java.util.logging.LogManager$4.run(LogManager.java:975) at java.logging/java.util.logging.LogManager$4.run(LogManager.java:971) at java.base/java.security.AccessController.doPrivileged(AccessController.java:318) at java.logging/java.util.logging.LogManager.loadLoggerHandlers(LogManager.java:971) at java.logging/java.util.logging.LogManager.initializeGlobalHandlers(LogManager.java:2424) at java.logging/java.util.logging.LogManager$RootLogger.addHandler(LogManager.java:2514) at java.logging/java.util.logging.LogManager.setLoggerHandlers(LogManager.java:989) at java.logging/java.util.logging.LogManager$4.run(LogManager.java:974) at java.logging/java.util.logging.LogManager$4.run(LogManager.java:971) at java.base/java.security.AccessController.doPrivileged(AccessController.java:318) at java.logging/java.util.logging.LogManager.loadLoggerHandlers(LogManager.java:971) at java.logging/java.util.logging.LogManager.addLogger(LogManager.java:1204) at java.logging/java.util.logging.LogManager$2.run(LogManager.java:398) at java.base/java.security.AccessController.doPrivileged(AccessController.java:318) at java.logging/java.util.logging.LogManager.ensureLogManagerInitialized(LogManager.java:383) at java.logging/java.util.logging.LogManager.getLogManager(LogManager.java:431) at java.logging/sun.util.logging.internal.LoggingProviderImpl.demandJULLoggerFor(LoggingProviderImpl.java:409) at java.logging/sun.util.logging.internal.LoggingProviderImpl.demandLoggerFor(LoggingProviderImpl.java:438) at java.base/jdk.internal.logger.DefaultLoggerFinder.getLogger(DefaultLoggerFinder.java:159) at java.base/jdk.internal.logger.LazyLoggers.getLoggerFromFinder(LazyLoggers.java:410) at java.base/jdk.internal.logger.LazyLoggers.getLazyLogger(LazyLoggers.java:465) at java.base/sun.util.logging.PlatformLogger.getLogger(PlatformLogger.java:297) at java.base/sun.net.www.protocol.http.HttpURLConnection.<clinit>(HttpURLConnection.java:436) at java.base/sun.net.www.protocol.http.Handler.openConnection(Handler.java:62) at java.base/java.net.URL.openConnection(URL.java:1144) at co.elastic.apm.agent.util.UrlConnectionUtils.openUrlConnectionThreadSafely(UrlConnectionUtils.java:47) at co.elastic.apm.agent.impl.metadata.CloudMetadataProvider.executeRequest(CloudMetadataProvider.java:418) at co.elastic.apm.agent.impl.metadata.CloudMetadataProvider.getAwsMetadata(CloudMetadataProvider.java:222) at co.elastic.apm.agent.impl.metadata.CloudMetadataProvider.access$000(CloudMetadataProvider.java:53) at co.elastic.apm.agent.impl.metadata.CloudMetadataProvider$1.call(CloudMetadataProvider.java:155) at co.elastic.apm.agent.impl.metadata.CloudMetadataProvider$1.call(CloudMetadataProvider.java:149) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at co.elastic.apm.agent.util.ExecutorUtils$2.run(ExecutorUtils.java:99) at java.base/java.lang.Thread.run(Thread.java:840) Can't load log handler "4host-manager.org.apache.juli.AsyncFileHandler" java.lang.ClassNotFoundException: 4host-manager.org.apache.juli.AsyncFileHandler java.lang.ClassNotFoundException: 4host-manager.org.apache.juli.AsyncFileHandler at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) at java.logging/java.util.logging.LogManager.createLoggerHandlers(LogManager.java:1005) at java.logging/java.util.logging.LogManager$4.run(LogManager.java:975) at java.logging/java.util.logging.LogManager$4.run(LogManager.java:971) at java.base/java.security.AccessController.doPrivileged(AccessController.java:318) at java.logging/java.util.logging.LogManager.loadLoggerHandlers(LogManager.java:971) at java.logging/java.util.logging.LogManager.initializeGlobalHandlers(LogManager.java:2424) at java.logging/java.util.logging.LogManager$RootLogger.addHandler(LogManager.java:2514) at java.logging/java.util.logging.LogManager.setLoggerHandlers(LogManager.java:989) at java.logging/java.util.logging.LogManager$4.run(LogManager.java:974) at java.logging/java.util.logging.LogManager$4.run(LogManager.java:971) at java.base/java.security.AccessController.doPrivileged(AccessController.java:318) at java.logging/java.util.logging.LogManager.loadLoggerHandlers(LogManager.java:971) at java.logging/java.util.logging.LogManager.addLogger(LogManager.java:1204) at java.logging/java.util.logging.LogManager$2.run(LogManager.java:398) at java.base/java.security.AccessController.doPrivileged(AccessController.java:318) at java.logging/java.util.logging.LogManager.ensureLogManagerInitialized(LogManager.java:383) at java.logging/java.util.logging.LogManager.getLogManager(LogManager.java:431) at java.logging/sun.util.logging.internal.LoggingProviderImpl.demandJULLoggerFor(LoggingProviderImpl.java:409) at java.logging/sun.util.logging.internal.LoggingProviderImpl.demandLoggerFor(LoggingProviderImpl.java:438) at java.base/jdk.internal.logger.DefaultLoggerFinder.getLogger(DefaultLoggerFinder.java:159) at java.base/jdk.internal.logger.LazyLoggers.getLoggerFromFinder(LazyLoggers.java:410) at java.base/jdk.internal.logger.LazyLoggers.getLazyLogger(LazyLoggers.java:465) at java.base/sun.util.logging.PlatformLogger.getLogger(PlatformLogger.java:297) at java.base/sun.net.www.protocol.http.HttpURLConnection.<clinit>(HttpURLConnection.java:436) at java.base/sun.net.www.protocol.http.Handler.openConnection(Handler.java:62) at java.base/java.net.URL.openConnection(URL.java:1144) at co.elastic.apm.agent.util.UrlConnectionUtils.openUrlConnectionThreadSafely(UrlConnectionUtils.java:47) at co.elastic.apm.agent.impl.metadata.CloudMetadataProvider.executeRequest(CloudMetadataProvider.java:418) at co.elastic.apm.agent.impl.metadata.CloudMetadataProvider.getAwsMetadata(CloudMetadataProvider.java:222) at co.elastic.apm.agent.impl.metadata.CloudMetadataProvider.access$000(CloudMetadataProvider.java:53) at co.elastic.apm.agent.impl.metadata.CloudMetadataProvider$1.call(CloudMetadataProvider.java:155) at co.elastic.apm.agent.impl.metadata.CloudMetadataProvider$1.call(CloudMetadataProvider.java:149) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at co.elastic.apm.agent.util.ExecutorUtils$2.run(ExecutorUtils.java:99) at java.base/java.lang.Thread.run(Thread.java:840)
My application uses logback for logging. The class java.util.logging is used only for tomcat part
how much activity is there on the app before it gets triggered ? for example is that an "intensive load test for 30min", or does it just crashes with few hundreds of transactions over 30min ? The number of transactions between the start and the time it crashed in Kibana could provide a good hint here. Also, do you have a rough estimate on the GC activity in terms of overhead or executions per minute ?
My application crashes in every 30-60mins after normal traffic. When I disable log-reformatting, pod is stable (15days without restart)
There was no intensive load tests, it's just normal production traffic.
In other application i have many messages from APM agent like
2024-03-12 11:27:54,279 [Worker-45] WARN co.elastic.apm.agent.impl.transaction.Span - Max spans (200) for transaction 'ACM_JOB' 00-74dd6dc51fe53e21518509858f7f1eca-e59cd06e8969cc62-01 (49052de7) has been reached. For this transaction and possibly others, further spans will be dropped. See config param 'transaction_max_spans'.
But Agent is working stable.
In application(which crash every 30-60mins), there is no logs like above
GcLogs looks good (there are no full gc pauses), throughput ~99.5%
Below you can see a chart with typical(production) number of transactions
Thanks for your feedback, with that we should be able to attempt reproducing it.
One interesting part is that while the application uses logback for logging, the JUL logging API used by Tomcat should have very light usage in production.
Logback is used due there is plugin(logback-logstash-encoder) to sending logs directly to logstash
Hi @fr19k, an Oracle engineer has suggested running the test with -Xint - which although it would be much slower, could indicate whether or not this is a compiler bug. Would you be willing to run that test? It would likely need to run for much longer than previous tests
Hi, no need to try this, we've reproduced the crash and found that disabling the c2 compiler (with -XX:TieredStopAtLevel=1
) stops it crashing, so most likely this is a c2 compiler bug. We'll continue to work on a resolution for this
To be closed after JVM release with https://bugs.openjdk.org/browse/JDK-8322726 fixed (we also expect a backport to 17 and 21, so wait for those for tracking purposes before closing)
Hi @yuvalbar84 this is not a jar option, the c2 compiler is a JIT compiler in the JVM. Your options are any of these
- run the JVM with the command line flag
-XX:TieredStopAtLevel=1
- run the agent with disable_instrumentations=log-reformatting, eg if setting the property on the command line this would be -Delastic.apm.disable_instrumentations=log-reformatting
- wait for the next release of the JVM, the JVM bug fix has been merged into 23 and backported to 21 and 17 so will be included in the next release of each of these JVM versions
Should be fixed now in JDK 17.0.12 according to the release notes.
@JonasKunz thank you !
MInor correction:
The issue is fixed for the oracle bundled patch releases 17.0.11.0.1 and 17.0.12.0.1. For OpenJDK the fix will ship with 17.0.13
. (JVM Bug)