Gauge re-registration warning can confuse users
Filing this issue after a conversation with @jonatan-ivanov in https://github.com/quarkusio/quarkus/issues/47504 and https://github.com/vert-x3/vertx-micrometer-metrics/issues/259
In Vert.x (and Quarkus), we use gauges to monitor things like pending messages in a message bus queue, active network connections, or active WebSockets. The need is similar to a counter, except it has the ability to decrease (internally, we use a LongAdder).
Because our users can enable/disable some tags in configuration, or customize meter ids with filters, we must be careful not to increment/decrement the wrong value.
For this reason we've built in Vert.x a small utility that allows to register a gauge as many times as needed and be sure to get the corresponding LongAdder.
I've put together a small reproducer that includes the utility in a project on GitHub: https://github.com/tsegismont/multiple-gauge-registration-warning
The problem is that, after https://github.com/micrometer-metrics/micrometer/pull/5617, a warning is logged when a gauge is re-registered, and it confuses some of our users (see the original reports in Vert.x and Quarkus).
I understand the rationale for this warning (as discussed in https://github.com/micrometer-metrics/micrometer/issues/5616) but I believe our use-case is valid. So, I'm starting this discussion in the hope to find a solution.
For obvious reasons, I'd rather not tell users to ignore the warning or raising the log level.
I can see two options:
- make it possible to mark a gauge as safe for multiple lookups in the builder, and make it possible to retrieve the monitored value
- introduce a new meter, hybrid between gauge and counter, that allows to increment and decrement values.
What do you think about this? Do you have other ideas?
I quickly looked into your reproducer and it seems you are tracking the number of active tasks (processed/pending messages). We have a component for this use-case it's called LongTaskTimer (ActiveTaskTimer could be a better name).
I modified your reproducer a little and used LongTaskTimer instead of Gauge: https://github.com/tsegismont/multiple-gauge-registration-warning/compare/master...jonatan-ivanov:micrometer-gh-6931-multiple-gauge-registration-warning:longtask-timer
I made the SPI context-aware but you can use something similar to LongGauges to get the LongTaskTimer.Sample instances. The output is a bit richer, instead of the number of pending tasks, you also get durations:
messagePending(LONG_TASK_TIMER)[address='foo']; active_tasks=1.0, duration=0.003666584 seconds, max=0.003694084 seconds
Thanks @jonatan-ivanov for your prompt reply.
To be honest, I had not considered LongTaskTimer, because of the name (and the implied timer nature). But it seems this could do the job for pending/active use-cases, at the expense of some metrics SPI changes for us.
Beyond the metrics SPI changes, I can see a couple of issues:
1/ users dashboards may be broken because metric names are going to change (value measurement renamed to active)
2/ we have a gauge that we create as derivative of another gauge: pool usage ratio (dividing the number of pooled resource in use by the max pool size)
3/ it wouldn't work when you need to control the value of increments, or the value is never decremented
Yes, the name does not reflect all the use-cases but we cannot change it unfortunately.
1/ You are right, this is a breaking change for the dashboards/alerts/everything that depends on those time-series. 2-3/ Not sure I understand this, could you please show me an example?
I took another look and I think I was able to implement this using a MultiGauge without breaking the SPI or the output, could you please take a look?
https://github.com/tsegismont/multiple-gauge-registration-warning/compare/master...jonatan-ivanov:micrometer-gh-6931-multiple-gauge-registration-warning:multigauge
Yes, the name does not reflect all the use-cases but we cannot change it unfortunately.
That's unfortunate but I understand: this happens more often than not when we developers name things 😉
1/ You are right, this is a breaking change for the dashboards/alerts/everything that depends on those time-series.
Thanks for confirming.
2-3/ Not sure I understand this, could you please show me an example?
Sure. We currently have a gauge that reports the number of used connections in pool. When the maximum pool size is known, we also create a gauge that reports the pool usage ratio (computed as the number of used connections divided by the maximum pool size). If the gauge that reports the number of used connections is transformed into a LongTaskTimer, I'm not sure we could compute such the derivative (ratio).
I took another look and I think I was able to implement this using a
MultiGaugewithout breaking the SPI or the output, could you please take a look? tsegismont/[email protected]:micrometer-gh-6931-multiple-gauge-registration-warning:multigauge
It wouldn't work because Micrometer would publish one of the values stored, instead of aggregating the values.
In our small reproducer, if the address tag is ignore, we want the pending gauge to combine increments and decrements (backed by a single LongAdder). I've updated your branch to show the problem:
https://github.com/tsegismont/multiple-gauge-registration-warning/commit/63a7267c74b6617930efd5476aea57acc63de2db
I would expect the program to print: ```messagePending(GAUGE)[]; value=5.0 messages` at some point but it doesn't
Also, it still prints the warning, but I'm not sure why.
WARNING: This Gauge has been already registered (MeterId{name='messagePending', tags=[]}), the registration will be ignored. Note that subsequent logs will be logged at debug level
I slept on it and I also realized that the MultiGauge solution does not work. :(
I came up with another workaround using Meters, could you please take a look?
https://github.com/tsegismont/multiple-gauge-registration-warning/compare/master...jonatan-ivanov:micrometer-gh-6931-multiple-gauge-registration-warning:meter-based
We are going to look into how to to properly solve this (most probably with a new API addition).
I slept on it and I also realized that the
MultiGaugesolution does not work. :(
No worries
I came up with another workaround using
Meters, could you please take a look? tsegismont/[email protected]:micrometer-gh-6931-multiple-gauge-registration-warning:meter-based
Thanks, I give it a try and that looks promising. I believe it could cover all the cases discussed above, I'll check this and come back to you.
In general, would you recommend against a custom meter ? Can you see any downsides?
As far as I understand, the regular Gauge provides the same iteration of Measurements, so can we expect the behavior to be the same? I'm think about features like percentiles.
We are going to look into how to to properly solve this (most probably with a new API addition).
OK, what do you have in mind?
Thanks for your support, it's much appreciated.
I tried the custom gauge solution in my fork of Vert.x Micrometer Metrics: https://github.com/tsegismont/vertx-micrometer-metrics/pull/3
On the upside, we can get rid of our internal LongGauges cache and the impact on the existing code is fairly limited.
On the downside, this is still a breaking change for users since the Prometheus registry creates metrics with different names. For example, vertx_http_client_active_connections_value instead of vertx_http_client_active_connections
I can see where this happens in the code, but I don't why there's a difference.
In general, would you recommend against a custom meter ? Can you see any downsides?
Usually, you should be able to solve your use-cases without using a custom meter, but it is there for unusual use-cases, there is nothing wrong using it when needed.
OK, what do you have in mind?
Multiple things but we need to investigate these, we will let you know, hoping you can give us some feedback once we will dig deeper.
On the downside, this is still a breaking change for users since the Prometheus registry creates metrics with different names. For example,
vertx_http_client_active_connections_valueinstead ofvertx_http_client_active_connectionsI can see where this happens in the code, but I don't why there's a difference.
Oh, I completely missed this. Yes, we add the _value postfix for the Prometheus registry:
https://github.com/micrometer-metrics/micrometer/blob/fdc555324e8db34c4db398527f761ac72c758b50/implementations/micrometer-registry-prometheus/src/main/java/io/micrometer/prometheusmetrics/PrometheusMeterRegistry.java#L412-L416
Shortly: we need to do this to prevent conflicts between time-series in the Prometheus Client.
Not so shortly: We need to do this because of the Prometheus Client and to comply with the Prometheus/OpenMetrics specs. In Prometheus, time-series with the same name are part of the same metric family and the metric family need to have a single type. So if you have multiple measurements, one is a counter, the other one is a gauge, they either need to have different names so they can have different types (counter and gauge) or they can share the same name but they need to share the same type (both the gauge and the counter need to be a gauge). If not, the Prometheus Client will not be happy and throws an exception.
There is an extra twist: counters will be suffixed with _total by the Prometheus Client but the name of the family in the client is the name without the suffix. The _total suffix is only added when the time-series is exported, that's why the code above does not add suffix to the counter and that's why a counter named test_total conflicts with a gauge named test in the client so the gauge needs a different suffix. Without this behavior of the client, we could have gone away without the _value suffix but since counters have an implicit suffix we needed to add an explicit one to everything else (that is not a counter).