Broader support for custom instrumentation/metrics
We have https://github.com/elastic/kibana/pull/123241 open from @chrisronline which proposes adding some new machinery to kibana to collect per-plugin metrics (to then be collected by metricbeat and viewed in stack monitoring).
I think we could publish similar metrics using apm.registerMetric which would avoid the need to modify metricbeat and elasticsearch mappings every time instrumentation changes.
One downside though is that the current implementation only really handles gauge values. It'd be good if it handled at least counters, similar to the micrometer integration in the java agent (cc @felixbarny ), or the prometheus integration in the golang agent.
There's probably a wider discussion to be had here between teams, but leveraging apm agent as a cross-runtime metric delivery mechanism was a great move for our internal cloud services (java, golang, python).
I think we could do the same for kibana if the node agent had similar support.
One issue that has come up with @chrisronline using counters (by cheating and using the internal/private API):
apm._metrics.getOrCreateCounter(name[, dimensions], callback)
is an unfortunate interaction with changes made in https://github.com/elastic/apm-agent-nodejs/pull/2163 that results in the counter metrics getting removed from the metrics registry if there is a metricsInterval where it doesn't increment. The issue is that (a) all "Counter" metrics are reset each metricsInterval (https://github.com/elastic/apm-agent-nodejs/blob/1ce50f32182707ca0992f6fdceea8c0471a044e3/lib/metrics/reporter.js#L50-L52) and (b) the stale checking in the PR above removes any counting metrics (of which "Counter" metrics are one) if their value is zero.
The workaround is to instead use apm._metrics.incrementCounter(name[, dimensions][, amount = 1]).
I think the solution used in #2163 should be revisited. Some ideas:
- Perhaps implicit metrics (i.e. the internal breakdown metrics added for multiple routes) could be separately considered for automatic removal if they look stale, but user-added custom metrics would not be automatically removed (or at least it would take a lot more to remove them).
- Perhaps automatic metric removal should only happen after a number of metricIntervals with no value updates and perhaps only if the total number of metrics is over some threshold. I.e. this automatic removal becomes a defence against cardinality issue only.
Thanks for writing this up.
Why do we reset counters on each reporting interval? A counter should only reset on instance reset or something, right?
Why do we reset counters on each reporting interval?
That predates me. I (ass)(u)(me) that is by design -- i.e. that the contract for metrics between APM agent and APM server is to reset counters on each reporting interval. It could also be a bug in the Node.js APM agent, but one that doesn't impact the limited use the Node.js APM agent has for counters (it uses them for breakdown metrics only, I think).
Fun to hear you've been looking at it. Would love to hear more!