jmx_exporter icon indicating copy to clipboard operation
jmx_exporter copied to clipboard

Unable to expose Metrics..

Open EdinHodzic1105 opened this issue 4 years ago • 9 comments
trafficstars

Dear All,

Recently after deploying some new zookeepers (3.5.9) we have noticed that there are some issues getting metrics exposed. Using the curl command on the server (curl localhost:10080) should give us the metrics but in this case it's saying that there is an empty reply from server. (curl: (52) Empty reply from server) I've tried to look into the jmx beans using jmxterm and that works, so it looks like that the data between the JMX and JMXexporter is somehow not being processed correctly. But in this case it works only when zookeeper is running in standalone mode.

Does anyone have an idea?

Thanks!

EdinHodzic1105 avatar Apr 21 '21 11:04 EdinHodzic1105

Issue found,

After we did some further testing we noticed that the zookeeper jmx metrics wont work when myid=0. So zk0 wont produce any metrics in this case.

Could you guys look at this please?

EdinHodzic1105 avatar Apr 22 '21 09:04 EdinHodzic1105

Do you mean the JMX beans don't provide metrics? In that case you should open an issue on https://github.com/apache/zookeeper. The jmx_exporter exposes metrics from JMX beans to Prometheus. If the JMX beans have no metrics, there's nothing the jmx_exporter can do.

fstab avatar May 16 '21 20:05 fstab

Do you mean the JMX beans don't provide metrics? In that case you should open an issue on https://github.com/apache/zookeeper. The jmx_exporter exposes metrics from JMX beans to Prometheus. If the JMX beans have no metrics, there's nothing the jmx_exporter can do.

After several days scratching our heads, we've found that we're having the same issue in our zookeeper ensemble being monitored by Prometheus Exporter. In all environments, the zookeeper node with myid=0 does not return any metrics (Empty reply from server). Connecting through jconsole, we can see the metrics there.

alexandrejuma avatar Nov 12 '21 18:11 alexandrejuma

Seeing the same problem, works fine up to version 0.12. Most likely the same issue mentioned here: https://github.com/prometheus/jmx_exporter/issues/509

Zookeeper 3.6.3 on Java 11 if it's of any help.

tobgu avatar Feb 24 '22 10:02 tobgu

@tobgu is this for a Zookeeper cluster? Do you have configuration you can share?

I just tested using a single Zookeeper (version 3.6.3), jmx-exporter (version 0.16.1), and zookeeper.yml (https://github.com/confluentinc/jmx-monitoring-stacks/blob/6.1.0-post/shared-assets/jmx-exporter/zookeeper.yml) and don't see any issues.

root@X> cat /var/lib/zookeeper/myid
0
root@X> cat /opt/zookeeper/etc/zookeeper.properties
admin.enableServer=false
autopurge.purgeInterval=1
autopurge.snapRetainCount=10
clientPort=2181
dataDir=/var/lib/zookeeper
initLimit=5
maxClientCnxns=0
server.1=X:2888:3888
syncLimit=2
EXTRA_ARGS=-javaagent:/opt/jmx-exporter/jmx_exporter.jar=7070:/opt/jmx-exporter/zookeeper.yml

dhoard avatar Mar 04 '22 12:03 dhoard

Yes, it's a three node cluster running as a statefulset in Kubernetes. Only node 0 is affected, the others are fine.

This is the config of node 0 (/apache-zookeeper-3.6.3-bin/conf/zoo.cfg):

tickTime=2000
initLimit=10
syncLimit=5
dataDir=/data
clientPort=2181
autopurge.snapRetainCount=5
autopurge.purgeInterval=1
server.0=0.0.0.0:2888:3888
server.1=zookeeper-1.zookeeper:2888:3888
server.2=zookeeper-2.zookeeper:2888:3888

This is the command line:

java -Dzookeeper.cnxTimeout=1000 -Dzookeeper.admin.enableServer=false -Dzookeeper.4lw.commands.whitelist=* -javaagent:/jmx_prometheus_agent.jar=8888:/apache-zookeeper-3.6.3-bin/conf/prometheus-config.yaml -Dzookeeper.root.logger=WARN,CONSOLE -cp /apache-zookeeper-3.6.3-bin/lib/*:/apache-zookeeper-3.6.3-bin/conf/ org.apache.zookeeper.server.quorum.QuorumPeerMain /apache-zookeeper-3.6.3-bin/conf/zoo.cfg

tobgu avatar Mar 15 '22 09:03 tobgu

Summary, TL;DR

The core issue is in the client_java/simpleclient_hotspot package / possibly a core JVM or JMX bug.

Due to the lack of error handling in client_java/simpleclient_httpserver, the underlying Exception is not being caught / logged.

Details

ThreadExports is used to get extended JVM metrics. This was added as part of the 0.5.0 release. As part of this code path, a ThreadMXBean is used to get all thread ids as well as associated ThreadInfo information.

ThreadInfo[] allThreads = threadBean.getThreadInfo(threadBean.getAllThreadIds(), 0);

https://github.com/prometheus/client_java/blob/4ce41e17c90022b2c307a3c51df5b3a917b80bed/simpleclient_hotspot/src/main/java/io/prometheus/client/hotspot/ThreadExports.java#L125

Sometimes the array of thread ids contains a long value of 0. When the array of thread ids contains a long value of 0, getThreadInfo correctly throws an IllegalArgumentException.

Because the calling code doesn't' catch/log any RuntimeException derived exceptions, it's effectively "eaten".

https://github.com/prometheus/client_java/blob/4ce41e17c90022b2c307a3c51df5b3a917b80bed/simpleclient_httpserver/src/main/java/io/prometheus/client/exporter/HTTPServer.java#L86

The fix is to get the array of thread ids, remove any <= 0, and then get the array of ThreadInfo objects. This will have to be changed in client_java/simpleclient_hotspot.

Additionally, I feel that we should change the client_java/simpleclient_httpserver HTTPMetricHandler to catch / log / rethrow any Exceptions. This type of handling would have made these issues easier to understand, etc. since it appears that the Sun HttpServer doesn't log Exceptions that occur in an HttpHandler.

Next Steps

I plan on implementing a fix in the relevant client_java projects this week and test on a 3 node ZooKeeper cluster.

Not sure how we can unit test this given that is highly dependent on JVM specifics and a specific configuration that seems to exacerbate the issue.

dhoard avatar May 17 '22 01:05 dhoard

More Research

It appears that ZooKeeper has code that overrides the thread getId() based on the value of myid which results in the issue.

https://issues.apache.org/jira/browse/ZOOKEEPER-4460

https://github.com/apache/zookeeper/blob/c74658d398cdc1d207aa296cb6e20de00faec03e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeer.java#L116

https://github.com/apache/zookeeper/blob/c74658d398cdc1d207aa296cb6e20de00faec03e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeer.java#L630


Per the ZooKeeper's Administration handbook for the latest release...

"The myid file consists of a single line containing only the text of that machine's id. So myid of server 1 would contain the text "1" and nothing else. The id must be unique within the ensemble and should have a value between 1 and 255. IMPORTANT: if you enable extended features such as TTL Nodes (see below) the id must be between 1 and 254 due to internal limitations."

(emphasis is mine)


Per this ZooKeeper issue, it appears that the value is not necessarily constrained and the range isn't consistent.

https://issues.apache.org/jira/browse/ZOOKEEPER-2503


Per this code (latest at the time of this update), it reaffirms that the value of myid is not constrained.

https://github.com/apache/zookeeper/blob/c74658d398cdc1d207aa296cb6e20de00faec03e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java#L738

dhoard avatar May 17 '22 16:05 dhoard

Code to work around Zookeeper's choice of overriding Thread.getId() has been merged into prometheus/client_java v0.16.0

Resolution will require a new release of jmx-exporter using prometheus/client_java v0.16.0 or newer.

dhoard avatar Jun 16 '22 12:06 dhoard

@EdinHodzic1105 have you tested with the JMX Exporter version v0.16.0 or newer?

Have you resolved this issue?

If there are no updates within 1 week, this will be closed as inactive.

dhoard avatar Apr 14 '23 12:04 dhoard