apm-agent-java icon indicating copy to clipboard operation
apm-agent-java copied to clipboard

NullPointerException that breaks the postgresql driver

Open aydasraf opened this issue 2 years ago • 4 comments

We are using the Elastic Java APM versions 1.33 and 1.34 in Auto Attachement deployment model within our environments.

Randomly, we get the below exception on many different java services that are monitored by the agents with the following Error stack:

java.lang.NullPointerException: null
	at org.postgresql.core.PGStream.<init>(PGStream.java:73)
	at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:93)
	at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:197)
	at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49)
	at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:211)
	at org.postgresql.Driver.makeConnection(Driver.java:459)
	at org.postgresql.Driver.connect(Driver.java:261)
	... 21 common frames omitted
Wrapped by: org.postgresql.util.PSQLException: Something unusual has occurred to cause the driver to fail. Please report this exception.
	at org.postgresql.Driver.connect(Driver.java:282)
	at org.apache.tomcat.jdbc.pool.PooledConnection.connectUsingDriver(PooledConnection.java:319)
	at org.apache.tomcat.jdbc.pool.PooledConnection.connect(PooledConnection.java:212)
	at org.apache.tomcat.jdbc.pool.ConnectionPool.createConnection(ConnectionPool.java:744)
	at org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:676)
	at org.apache.tomcat.jdbc.pool.ConnectionPool.init(ConnectionPool.java:483)
	at org.apache.tomcat.jdbc.pool.ConnectionPool.<init>(ConnectionPool.java:154)
	at org.apache.tomcat.jdbc.pool.DataSourceProxy.pCreatePool(DataSourceProxy.java:118)
	at org.apache.tomcat.jdbc.pool.DataSourceProxy.createPool(DataSourceProxy.java:107)
	at org.apache.tomcat.jdbc.pool.DataSourceProxy.getPool(DataSourceProxy.java:214)
	at org.springframework.boot.jdbc.metadata.TomcatDataSourcePoolMetadata.getActive(TomcatDataSourcePoolMetadata.java:36)
	at wkda.common.metrics.DataSourceMetricsReader.getActiveConnectionCount(DataSourceMetricsReader.java:71)
	at io.micrometer.core.instrument.internal.DefaultGauge.value(DefaultGauge.java:54)
	at co.elastic.apm.agent.micrometer.MicrometerMeterRegistrySerializer.serializeMetricSet(MicrometerMeterRegistrySerializer.java:134)
	at co.elastic.apm.agent.micrometer.MicrometerMeterRegistrySerializer.serialize(MicrometerMeterRegistrySerializer.java:87)
	at co.elastic.apm.agent.micrometer.MicrometerMetricsReporter.run(MicrometerMetricsReporter.java:97)
	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.util.concurrent.FutureTask.runAndReset(Unknown Source)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)

aydasraf avatar Sep 29 '22 08:09 aydasraf

The exception doesn't affect any functionality but will cause some metrics to fail to be sent. Mostly these won't matter, but needs better handling, we'll fix it for an upcoming version. If you are concerned about the metrics being missed, the cumulative ones will catch up, interval ones will be missed if you have intervals matching the elastic reporting interval, otherwise they are incorrect (https://github.com/elastic/apm-agent-java/pull/2801 is addressing that)

jackshirazi avatar Sep 29 '22 16:09 jackshirazi

@jackshirazi Thank you so much for the reply. We will be looking forward to the fix in the near future.

Do you have any ruff estimates on when to expect it? As it is causing a bit of chaos in our SRE log monitoring tools and alerts. Can this be suppressed by any means until the fix is done?

aydasraf avatar Oct 04 '22 05:10 aydasraf

You could turn off the metrics for now, ie disable the micrometer instrumentation.

Alternative is to fork the repo and wrap the gauge.value() call in a try-catch (and don't serialize the gauge stat if the exception is triggered). I think it's probably a quick fix, we have people on holiday at the moment so it hasn't yet been applied

jackshirazi avatar Oct 04 '22 13:10 jackshirazi

Hey @aydasraf,

I had another look at our code, the error handling of Micrometer Meters there already seems fine. We loop over all Meters to collect their values. We catch any thrown exceptions in the loop, e.g. one failing Meter won't cause the other Meters to not be collected. After a Meter has failed once, it won't be collected by the agent anymore until a restart.

Did you see the exception within the agent-logs or within the application logs? The agent-logs should only contain the exception and stack trace when on debug level.

JonasKunz avatar Oct 11 '22 08:10 JonasKunz

Hey @JonasKunz Thank you for your reply. It was part of the application logs, not the agent ones.

aydasraf avatar Oct 14 '22 09:10 aydasraf

In that case there unfortunately isn't really anything we can do on the agent side. You could only disable the causing micrometer metric as @jackshirazi suggested.

Other than that I would suggest finding out which library (e.g. tomcat connection pool or postgres driver) actually prints the exception to stdout and maybe open an issue there.

JonasKunz avatar Oct 14 '22 10:10 JonasKunz

I'm closing this issue for now, as I don't see anything we can do here. Feel free to reopen if required.

JonasKunz avatar Nov 02 '22 08:11 JonasKunz