collectd-fast-jmx One server can block polling for others

First, thanks for this. We are using it happily in production and it has solved an issue we had with the GenericJMX plugin.

Now, we have one FastJMX plugin definition, with multiple MBean entries (memory, threading, GC, ...) and multiple Connection blocks. Our monitoring shows FastJMX dropping out for all connections defined occasionally, which seems to be related to one of the machines under monitoring running into an OutOfMemory error. During the dropout collectd happily collects data for unrelated plugins. Restarting said machine will revive FastJMX, magically bring back graphs for all other connections as well. The collectd logs contain FastJMX Plugin: Failed to collect 98 of 98 samples within read interval with 2 threads until the restart.

All machines are running JRuby on either JVM7 or 8. We are using 1.0.0 from collectd-fast-jmx.

Expected behaviour would be for all unrelated connections to continue.

Jan 05 '17 03:01 cburgmer

Apologies for the slow response on this, but thank you for the report!

This is definitely an issue. Most likely the instrumented server dealing with OOME hasn't terminated the JMX connections and is still trying to service requests, however in those situations it's not unreasonable for the requests to either never complete, or take longer than the collect interval.

The way fast jmx prioritizes collection tasks, it tries to collect the 'slow' (high latency) metrics first. The theory was that you'd have less time-slew between all the metrics if the slow ones collect first, then the faster ones. The issue you're running into is that you've got a hung server, so all those 'slow' metrics going 'first' are blocking the ones that can complete.

The 'dirty' way to fix this would be to introduce a 'minThreads' setting, so that you could force at least (num beans collecting per server + 1) threads in the pool, which would give you some metrics.

In your log messages, did the number of threads being used ever rise above 2? I'm sorta miffed that it would fail to read any samples and keep the thread count there unless you have a MaxThreads set.

Mar 09 '17 14:03 bvarner

Sorry for the late response. I'm Chris teammate.

Most of the time it's either 1 or 2 threads. But few times it is raised above 30, max being 233. FastJMX Plugin: Failed to collect 90 of 90 samples within read interval with 233 threads

In your log messages, did the number of threads being used ever rise above 2? I'm sorta miffed that it would fail to read any samples and keep the thread count there unless you have a MaxThreads set.

Will try this approach.

The 'dirty' way to fix this would be to introduce a 'minThreads' setting, so that you could force at least (num beans collecting per server + 1) threads in the pool, which would give you some metrics.

Mar 21 '17 14:03 Manikandan-K

Thanks for the feedback, @Manikandan-K. Any luck with a min-Threads approach?

Apr 05 '17 14:04 bvarner

collectd-fast-jmx collectd-fast-jmx copied to clipboard

One server can block polling for others

collectd-fast-jmx
collectd-fast-jmx copied to clipboard