Kamon icon indicating copy to clipboard operation
Kamon copied to clipboard

Metrics are reported via kamon-prometheus even though they are removed

Open andreas-schroeder opened this issue 6 years ago • 7 comments

We've observed that removing tag-refined metrics does not remove them from the accumulated snapshot exposed by the prometheus reporter. The more dynamic the metric tags are, the more this eventually leads to very large prometheus response bodies containing large amounts of "stale" metrics. Consequently, this also artificially increases the number of metrics on the prometheus server.

I have the impression that the metric removal is not fully honored. Interestingly, #555 claims that the inverse should be ensured - to report the metric one last time before it is scraped, and only then remove it.

The problem seems to be that PrometheusReporter uses a PeriodSnapshotAccumulator with a duration of five years in order to maintain counter increments and histogram counts. The accumulator does not contain any logic for reacting to a metric being removed. The only way a metric can be removed is by clearing all metrics once the "tick" has passed (which in this case would take five years), and start building the next accumulated snapshot. Reducing the duration parameter would fix the issue, but also lead to histogram and counter increments being lost - namely those that are performed between the last prometheus scrape and the internal metric clearing.

kamon-core version: 1.1.3 kamon-prometheus version: 1.0.0 (though the relevant PeriodSnapshotAccumulator comes from kamon-core)

andreas-schroeder avatar Jan 22 '19 12:01 andreas-schroeder

We use kamon-prometheus (v.1.1.1) with akka cluster. The problem of not removing metrics results in exposing the same metric from more than one cluster nodes. It happens when akka sharding actor is moved between cluster nodes. I checked the code of kamon-prometheus project and there is a PrometheusReporter class which uses class PeriodSnapshot.Accumulator (PeriodSnapshotAccumulator in version 1.1). The accumulator class (in add() method) updates the metrics maps based on passed period snapshot (periodSnapshot). The problem is that this method doesn't remove metrics from cache which are not present in new period snapshot.

I added such removal and tested it. It seems to work well.

@ivantopo What do you think about this approach? Is is a good idea to remove not present in new period snapshot from cache?

golojg avatar Oct 04 '19 08:10 golojg

kamon-bundle version: 2.0.3 kamon-prometheus version: 2.0.0

Because of PrometheusReporter use PeriodSnapshot.Accumulator with a duration of 5*365 days. Today our server does not expose any metrics.

This line of code clean all metrics because _nextTick calculated by Clock.nextAlignedInstant is 2024-12-18T00:00:00Z, and then all metrics have been regarded as stale entries. https://github.com/kamon-io/Kamon/blob/e093cabc6685ee62595c5d912f126b97590523a6/kamon-core/src/main/scala/kamon/metric/PeriodSnapshot.scala#L141


Clock.nextAlignedInstant(Instant.now, Duration.ofDays(365 * 5))

Code above returned different value on yesterday(2019.12.19) and today(2019.12.20).

  • Yesterday: 2019-12-20T00:00:00Z
  • Today: 2024-12-18T00:00:00Z

invzhi avatar Dec 20 '19 08:12 invzhi

We currently see the same behaviour. No metrics reported anymore today since midnight.

msiuts avatar Dec 20 '19 10:12 msiuts

2.0.1 with a fix has been published to bintray (still syncing maven central) https://bintray.com/kamon-io/releases/kamon-prometheus/2.0.1

mladens avatar Dec 20 '19 16:12 mladens

We still encounter the issue on 2.0.1. The instrument is removed during snapshotting, but the prom reporter still exposes it indefinitely.

If the instrument is a counter and is recreated, it restarts from 0, yet the Prom reporter aggregates the new counter's value with its previous state, with the end result equivalent to the counter never having been removed.

cosminci avatar Mar 24 '20 15:03 cosminci

I have the same issue that @golojg has. Using kamon 2.1.3 with prometheus and default settings.

We have a cluster of N nodes (> 10), three of which are with sharding.

At a certain point in time, we had about 350 entities in all regions, and kamon-prometheus was reporting this correctly (as you can see from the graph below, until ~ 14.00 on 08/14). However, when one of the node crashed (around 14.00) and our DowningProvider downed/removed the node from the cluster, after the new node had joined the cluster, the metrics reported were 350 + 129. This 129 is taken from the node that went down (it's probably the last known value, I guess?) that probably got distributed among all shards in all regions. I then called the /cluster/shards/{name} endpoint on all the nodes with sharding and the sum is 350, which means that the number of active entities reported by Akka doesn't match the ones reported by Prometheus.

Screenshot 2020-08-18 at 10 10 07

@ivantopo do you have any idea why this happens? I checked the code and I can't understand why this would happen. We have "rememberEntities=off" and "automatic-passivation=5 minutes", so this should not happen, even after a rebalance.

mbuccini avatar Aug 18 '20 08:08 mbuccini

Are there any updates on this issue? We are using akka with kanela-agent.1.0.11 and kamon-akka.2.2.0 and kamon-prometheus.2.2.0. In our setup we have an application with an active instance and an inactive instance. We tag all metrics as such, so we can differentiate. But if we dynamically change the active instance, the metrics with the old tags are still reported, leading to duplicated metrics. We call remove either on the metric (passing the tags) or on the instrument, and in both cases it returns true. But in the endpoint the metric is still there. Are we doing something wrong? Is there another way of preventing an already existing instrument to be exposed in the prometheus endpoint? I see that #574 claims to solve this issue, so I am not sure if I am doing something wrong. But at the same time it claims to solve #555, which to me seems paradoxical. I wonder if it makes sense to differentiate between "closing" an instrument (i.e. it will no longer being updated but has to be still reported at least once), and "removing" it (i.e. no need to report it any more). Another idea might be to define the expected behavior when removing a metric via configuration (allowing individual overrides for specific metrics), and use an enriched version of the MetricOverrideReporter I am thinking to propose a PR to address this (still not clear how to implement it, have some ideas), but probably I am missing some context on it so any advice is more than welcome. Thank you

sgabalda avatar Aug 16 '21 07:08 sgabalda