apm-agent-java
apm-agent-java copied to clipboard
NullPointerException that breaks the postgresql driver
We are using the Elastic Java APM
versions 1.33
and 1.34
in Auto Attachement
deployment model within our environments.
Randomly, we get the below exception on many different java services that are monitored by the agents with the following Error stack:
java.lang.NullPointerException: null
at org.postgresql.core.PGStream.<init>(PGStream.java:73)
at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:93)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:197)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:211)
at org.postgresql.Driver.makeConnection(Driver.java:459)
at org.postgresql.Driver.connect(Driver.java:261)
... 21 common frames omitted
Wrapped by: org.postgresql.util.PSQLException: Something unusual has occurred to cause the driver to fail. Please report this exception.
at org.postgresql.Driver.connect(Driver.java:282)
at org.apache.tomcat.jdbc.pool.PooledConnection.connectUsingDriver(PooledConnection.java:319)
at org.apache.tomcat.jdbc.pool.PooledConnection.connect(PooledConnection.java:212)
at org.apache.tomcat.jdbc.pool.ConnectionPool.createConnection(ConnectionPool.java:744)
at org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:676)
at org.apache.tomcat.jdbc.pool.ConnectionPool.init(ConnectionPool.java:483)
at org.apache.tomcat.jdbc.pool.ConnectionPool.<init>(ConnectionPool.java:154)
at org.apache.tomcat.jdbc.pool.DataSourceProxy.pCreatePool(DataSourceProxy.java:118)
at org.apache.tomcat.jdbc.pool.DataSourceProxy.createPool(DataSourceProxy.java:107)
at org.apache.tomcat.jdbc.pool.DataSourceProxy.getPool(DataSourceProxy.java:214)
at org.springframework.boot.jdbc.metadata.TomcatDataSourcePoolMetadata.getActive(TomcatDataSourcePoolMetadata.java:36)
at wkda.common.metrics.DataSourceMetricsReader.getActiveConnectionCount(DataSourceMetricsReader.java:71)
at io.micrometer.core.instrument.internal.DefaultGauge.value(DefaultGauge.java:54)
at co.elastic.apm.agent.micrometer.MicrometerMeterRegistrySerializer.serializeMetricSet(MicrometerMeterRegistrySerializer.java:134)
at co.elastic.apm.agent.micrometer.MicrometerMeterRegistrySerializer.serialize(MicrometerMeterRegistrySerializer.java:87)
at co.elastic.apm.agent.micrometer.MicrometerMetricsReporter.run(MicrometerMetricsReporter.java:97)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.runAndReset(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
The exception doesn't affect any functionality but will cause some metrics to fail to be sent. Mostly these won't matter, but needs better handling, we'll fix it for an upcoming version. If you are concerned about the metrics being missed, the cumulative ones will catch up, interval ones will be missed if you have intervals matching the elastic reporting interval, otherwise they are incorrect (https://github.com/elastic/apm-agent-java/pull/2801 is addressing that)
@jackshirazi Thank you so much for the reply. We will be looking forward to the fix in the near future.
Do you have any ruff estimates on when to expect it? As it is causing a bit of chaos in our SRE log monitoring tools and alerts. Can this be suppressed by any means until the fix is done?
You could turn off the metrics for now, ie disable the micrometer instrumentation.
Alternative is to fork the repo and wrap the gauge.value()
call in a try-catch (and don't serialize the gauge stat if the exception is triggered). I think it's probably a quick fix, we have people on holiday at the moment so it hasn't yet been applied
Hey @aydasraf,
I had another look at our code, the error handling of Micrometer Meter
s there already seems fine.
We loop over all Meter
s to collect their values. We catch any thrown exceptions in the loop, e.g. one failing Meter
won't cause the other Meter
s to not be collected. After a Meter
has failed once, it won't be collected by the agent anymore until a restart.
Did you see the exception within the agent-logs or within the application logs? The agent-logs should only contain the exception and stack trace when on debug level.
Hey @JonasKunz Thank you for your reply. It was part of the application logs, not the agent ones.
In that case there unfortunately isn't really anything we can do on the agent side. You could only disable the causing micrometer metric as @jackshirazi suggested.
Other than that I would suggest finding out which library (e.g. tomcat connection pool or postgres driver) actually prints the exception to stdout and maybe open an issue there.
I'm closing this issue for now, as I don't see anything we can do here. Feel free to reopen if required.