micrometer icon indicating copy to clipboard operation
micrometer copied to clipboard

Unregister meters whose weak reference has been broken

Open mprimi opened this issue 6 years ago • 22 comments

I'm like to report something that is slightly counter-intuitive. What i am suggesting is either a change of behavior or explicit documentation on the current one.

The behavior:

  • a meter is created with a weak reference object
  • After the referenced object is GC'd, the corresponding meter is retained forever
  • This meter is effectively useless, it won't ever publish a value again (because it's object is gone forever)

Keeping a useless object in memory seems not great. But there's actually an even more subtle consequence (which I find undesirable).

Here's an example use case where this behavior can create counter-intuitive behavior.

I'm publishing a metric tagged with a username to track per-user resource utilization. The number of active users is small, and I'd like to avoid publishing millions of zeros for all inactive users.

When a user is active, I register a gauge like so:

  final AtomicLong userMemoryAmount = registry.gauge(
    "user-memory", 
     Sets.newHashSet(Tag.of("user", user)), 
     new AtomicLong(0)
  );

This gauge has a weak reference to userMemoryAmount, which is great. I keep an external strong reference to userMemoryAmount and update it.

Later, when the user goes away, I will remove the strong reference to userMemoryAmount, and eventually this metric will stop publishing for this user, since userMemoryAmount is GC'd.

So far so good. This is the behavior I want. The measured object goes away, the metric (eventually) goes away. 👍

Here's where I see a problem.

The same user comes back 2 days later. I don't have an AtomicLong userMemoryAmount tracking that user, so i create a new one:

  final AtomicLong userMemoryAmount = registry.gauge(
    "user-memory", 
     Sets.newHashSet(Tag.of("user", user)), 
     new AtomicLong(0)
  );

Same code as above.

What i expect: a new gauge that references my newly created userMemoryAmount (the existing meter would also be acceptable, if its reference was still valid). What i get instead: the old gauge with a broken weak reference to the old userMemoryAmount. This is effectively garbage that will never collected.

--

Maybe this behavior is by design and working exactly as expected, but once again i would like to call out that, to my understanding:

  • The meters is kept forever, even if it is useless
  • The exsisting meter is getting in the way of creating a new meter with the same ID

The API and/or documentation could be a bit more explicit about this behavior.

mprimi avatar Jun 07 '19 21:06 mprimi

Thank you for the detail explanation of the use-case you have and the issue you are facing. As a (possibly long-term) workaround, have you tried using the remove method to remove the Gauge instead of dereferencing the gauge's value object? See #479 Separate from that we can consider your request also, but I just wanted to point that out for now.

shakuzen avatar Jun 14 '19 09:06 shakuzen

Hello @shakuzen I have tried using remove, however it doesn't seem to interact well with our backing implementation (Spectator) and the following happens:

  • Create gauge g for user u
  • Update the value read by g as long as u is active
  • When u goes away, do registry.remove(g)

This works as expected, I see (e.g., in Atlas), that g had the expected values and then disappears. 👍

Let's say the last value published for g was 10. And let's say later user u comes back sometimes later (minutes, hours).

  • I create gauge g' for user u (g and g' have the same meter id)
  • g' starts reading my value and publishing

Problem is: all values I see for u now are offset by 10! (10 the last value published by g).

From what I can tell, Micrometer removes g, but Spectator does not! It keeps g's last value around and adds it to g' before publishing to Atlas!

If the last value read by g' is now 5 for example, then u goes away and comes back, then all its value will be offset by 15. And so on.

mprimi avatar Jun 14 '19 18:06 mprimi

I just stumbled upon a similar issue (related to #2642). We have Caffeine Cache instances that are long-lived, but in some cases, the instances are re-created during the runtime of the application. At the moment, there's no place to hook in a remove() call that would clean up the reference either, because the class that creates the cache instance is not (at the moment) controlled by any DI container like Spring.

For cases where the application registers a gauge or counter, I guess it would be possible to always and proactively remove any offending remainings in the MeterRegistry before trying to register a gauge/counter, but for complex metrics like caches it would mean we need to duplicate all the implementation details about which counters/gauges it registers.

How would this even be solved? Would the Meter interface know about the fact that the implementations can have weakly referenced objects inside them and a method for checking if the object is dead or not?

slovdahl avatar Jun 22 '21 11:06 slovdahl

If someone can provide guidance on what the best way of solving this would be, I'd be willing to spend some time on getting this fixed.

slovdahl avatar Jul 06 '21 17:07 slovdahl

From what I can tell, Micrometer removes g, but Spectator does not!

Sorry for the incredibly late response. @mprimi that looks like a bug in our AtlasMeterRegistry (assuming that's what you're using) which should probably be removing from the underlying AtlasRegistry when removed from Micrometer's registry. Would you open a separate issue for that please?

shakuzen avatar Aug 25 '21 07:08 shakuzen

How would this even be solved? Would the Meter interface know about the fact that the implementations can have weakly referenced objects inside them and a method for checking if the object is dead or not?

Only Gauge and Function* type meters take a backing object that could be garbage collected. These types already detect when the backing object is no longer available. But meters don't have a reference to registries. If these types exposed API to express whether their backing object was still live, perhaps registries could check this when publishing.

shakuzen avatar Aug 25 '21 07:08 shakuzen

These types already detect when the backing object is no longer available.

Yep, that's true, but even if the backing object is garbage collected, the reference to the actual gauge/function is still kept by Micrometer, which means that a new gauge/function with the same name and set of tags can't be registered.

If these types exposed API to express whether their backing object was still live, perhaps registries could check this when publishing.

:+1:

Do you have any pointers where this "registries could check" part could/would be done?

slovdahl avatar Aug 25 '21 10:08 slovdahl

Do you have any pointers where this "registries could check" part could/would be done?

Other than in each registry's implementation of the publish method, not really. Figuring that out is the hard design work of this issue.

shakuzen avatar Aug 26 '21 06:08 shakuzen

Other than in each registry's implementation of the publish method, not really. Figuring that out is the hard design work of this issue.

Ok :+1:

A little bit related but still a side note. In the specific CacheMeterBinder case, it could also work to proactively remove any existing gauges/function counters before the new one is registered, wouldn't it? That is, first check if there's already a gauge with the same set of tags registered, delete it, and then after that add the new gauge.

slovdahl avatar Aug 30 '21 08:08 slovdahl

it could also work to proactively remove any existing gauges/function counters before the new one is registered, wouldn't it?

Maybe, but we don't do that anywhere currently. As I mentioned in https://github.com/micrometer-metrics/micrometer/issues/2719#issuecomment-906241352 that leads to another problem of things effectively resetting that shouldn't reset while an application is running. In practice, it might not be a problem, though. I suppose it doesn't matter for registries that publish step-normalized values, and for registries that report cumulative values, it is expected the metrics backend knows how to handle resets because it will happen when an application restarts.

shakuzen avatar Aug 30 '21 08:08 shakuzen

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

github-actions[bot] avatar Dec 30 '23 01:12 github-actions[bot]

The issue is still valid.

slovdahl avatar Jan 02 '24 06:01 slovdahl

This issue is also affecting us. At the moment I have put a facade in top of micrometer and there I am keeping my own ConcurrentHashMap<Meter.Id, WeakReference<ObservedType>> and the registration of Gauges passes through it and, if the WeakReference is empty, I manually deregister the metric via ID from Micrometer and then register the new one (this all happens in a single compute call).

This, however, is not optimal because I still have to pipe everything through my own facade on top of what is effectively also a facade.

I know the Gauge checks itself whether the observed object is still alive. When it sees it isn't, it's likely easy to safely deregister it from the meter registry. It might take some work to figure out due to all the locking going on, but it should be straightforward enough.

andrebrait avatar Sep 10 '25 15:09 andrebrait

I know the Gauge checks itself whether the observed object is still alive. When it sees it isn't, it's likely easy to safely deregister it from the meter registry. It might take some work to figure out due to all the locking going on, but it should be straightforward enough.

Might be in a use-case but something is not being available can give you just as much information as having a value. A metric not being present at all usually has a different meaning than a metric is being there but its value is NaN and others can depend on this behavior. Maybe that behavior also reflects reality better: something was there but now it isn't.

Not sure what your exact use-case is but maybe you can utilize MultiGauge that keeps references to your registered Gauges for you and also "auto-removes" them, see this example:

SimpleMeterRegistry registry = new SimpleMeterRegistry();
MultiGauge temperatures = MultiGauge.builder("temperatures").baseUnit("celsius").register(registry);

temperatures.register(
    Arrays.asList(
        Row.of(Tags.of("room", "kitchen"), 24),
        Row.of(Tags.of("room", "bedroom"), 22)
    ),
    true
);

System.out.println("---\n" + registry.getMetersAsString());
temperatures.register(Arrays.asList(Row.of(Tags.of("room", "kitchen"), 23)), true);
System.out.println("---\n" + registry.getMetersAsString());

The second register call updates the value of kitchen (24 -> 23) and removes the Gauge for bedroom since it is not provided anymore:

---
temperatures(GAUGE)[room='kitchen']; value=24.0 celsius
temperatures(GAUGE)[room='bedroom']; value=22.0 celsius
---
temperatures(GAUGE)[room='kitchen']; value=23.0 celsius

jonatan-ivanov avatar Sep 10 '25 22:09 jonatan-ivanov

In this case, the only information you could possibly get is that the object has been garbage collected.

Although arguably useful, I'd guess it makes more sense to just drop the metric. It already does that when it comes to output, IIRC. The only thing missing is allowing you to register the metric again with a new observed object.

When creating a Gauge purely with a function and that function starts returning null it does output as NaN; but when registering it as an object + a double function, the object being GC'd just causes the metric it to go away, IIRC.

Gonna confirm that with Prometheus (which supports NaNs).

andrebrait avatar Sep 11 '25 01:09 andrebrait

Exactly, you get the information that something was there previously and now it's gone (which might be a mistake). Maybe dropping the metric in this case makes sense to you but there are other users/use-cases where it makes sense to distinguish between something that has never been there or something that was there but now it is not. This is intentional in Micrometer.

When creating a Gauge purely with a function and that function starts returning null it does output as NaN; but when registering it as an object + a double function, the object being GC'd just causes the metric it to go away, IIRC.

I'm not sure what the context here is, Micrometer does not behave like this, and I thought that's what your issue is. You can try it out:

SimpleMeterRegistry registry = new SimpleMeterRegistry();
registry.gauge("test", 1.0, state -> state);
System.out.println(registry.getMetersAsString());
System.gc();
Thread.sleep(1_000);
System.out.println(registry.getMetersAsString());

the output should be:

test(GAUGE)[]; value=1.0
test(GAUGE)[]; value=NaN

Could you please tell us what your exact use-case is? It seems it is somewhat different than the original issue and maybe it can be solved by using a MultiGauge, see my previous comment above.

jonatan-ivanov avatar Sep 11 '25 21:09 jonatan-ivanov

@jonatan-ivanov I was about to update this issue with that. I tested it this afternoon and it works exactly as you stated. It does output NaN.

I guess it could be yet another configuration option, like the one that already exists for keeping a strong reference to the object? But instead it would be something to signal that the metric must be removed (instead of returning NaN) when the object becomes weak unreachable.

andrebrait avatar Sep 11 '25 21:09 andrebrait

It needs a bit of a design work since this might be more complicated than it seems but a config option can be a solution, another type of a meter can be too, or a special Gauge implementation.

What I would like to emphasize: in order to do anything, we will need your exact use-case. Could you please tell us what your exact use-case is? :)

jonatan-ivanov avatar Sep 11 '25 21:09 jonatan-ivanov

It needs a bit of a design work since this might be more complicated than it seems but a config option can be a solution, another type of a meter can be too, or a special Gauge implementation.

What I would like to emphasize: in order to do anything, we will need your exact use-case. Could you please tell us what your exact use-case is? :)

My use-case is perhaps a bit niche, I suppose, but here it goes:

Before stuff like Micrometer came along, someone at the company I work for had coded an "on-demand" wrapper around Prometheus simpleclient's Gauge. The wrapper takes a Supplier<Number> and, instead of registering a real Gauge right away, it hides that behind a custom Collector and returns a handle object for the user to keep and dispose of at will.

It also keeps a weak reference to that handle so it can dispose of the corresponding Gauge if that gets GC'd.

The custom Collector, when scraped, iterates over all its handles and checks whether they're alive. If so, it calls the function and adds then to the scraping results. Otherwise, it removes them from the list and moves on.

I'm now porting the entire thing over to Micrometer, because there was a lot of custom code built around it that does what Micrometer does (e.g. common labels, filters, etc.) and we already use a lot of instrumentation from Micrometer anyways (e.g. Kafka) so instead of maintaining multiple ways of doing metrics and worrying about consistency between them, I just chose to port it.

The port is completely done, except for some uncertainty regarding the behavior of Summaries/Histograms and time windows (we'll likely just validate that and account for the differences) and the fact the "on-demand" Gauge wrapper's behavior cannot be fully reproduced as the GC'd handles now result in NaN whereas before they're just be removed.

I have found my own workaround for that (e.g. Cleaner) but it's not ideal.

ps.: the codebase is humongous and the porting is mostly accomplished without any publicly visible changes and all differences between the two APIs is handled internally, so just changing how the on-demand Gauge works externally is a no go, at the moment.

andrebrait avatar Sep 12 '25 06:09 andrebrait

So since Prometheus Gauges are synchronous (you call set on them), a collector was created to make use them in an asynchronous way (gauge vale is pulled at scrape)? I guess since Micrometer Gauges are already async by nature you want all that custom code to work without most of the custom code.

What I would try to do instead as a first step is where you create and set the Gauge with the Prometheus client:

Gauge temperature = Gauge.builder()
    .name("temperature_celsius")
    .help("current temperature")
    .labelNames("location")
    .unit(Unit.CELSIUS)
    .register();
temperature.labelValues("Berlin").set(22.3);

You can do that in Micrometer too:

MultiGauge temperatures = MultiGauge.builder("temperature")
    .description("current temperature")
    .baseUnit("celsius")
    .register(meterRegistry);

temperatures.register(Collections.singletonList(MultiGauge.Row.of(Tags.of("location", "Berlin"), 22.3)), true);

This will not take care abut removal by default but since you already have the list of gauges/values, and if you remove that from your internal list and call MultiGauge register again without those elements, the gauge will be automatically removed for you by the MultiGauge, see: https://github.com/micrometer-metrics/micrometer/issues/1452#issuecomment-3276762440

jonatan-ivanov avatar Sep 12 '25 19:09 jonatan-ivanov

So since Prometheus Gauges are synchronous (you call set on them), a collector was created to make use them in an asynchronous way (gauge vale is pulled at scrape)? I guess since Micrometer Gauges are already async by nature you want all that custom code to work without most of the custom code.

What I would try to do instead as a first step is where you create and set the Gauge with the Prometheus client:

Gauge temperature = Gauge.builder() .name("temperature_celsius") .help("current temperature") .labelNames("location") .unit(Unit.CELSIUS) .register(); temperature.labelValues("Berlin").set(22.3); You can do that in Micrometer too:

MultiGauge temperatures = MultiGauge.builder("temperature") .description("current temperature") .baseUnit("celsius") .register(meterRegistry);

temperatures.register(Collections.singletonList(MultiGauge.Row.of(Tags.of("location", "Berlin"), 22.3)), true);

That's precisely what I did and yes, I got rid of most of the custom code that way.

However, since the code expects me to return a Handle the user can interact with (e.g. close it to remove the Gauge) but it also expects the Gauge to be removed if the Handle is GC'd, the removal on the reference being missing part was somewhat important.

Not so much because the user might re-add the same Gauge at a later time (I already have to keep track of the handles separately due to other reasons, so knowing whether the previous handle was GC'd isn't a problem for my specific case), but because the current behavior will remove the metric from the registry upon the Handle being GC'd.

It's unlikely the NaN's will cause actual issues, but since I'm aiming the port at being as safe as possible (that is, no behaviors are expected to change), having that divergence and possibly affecting alerts and whatever math people are using on Alertmanager and whatnot is somewhat of a cause for concern for me.

But either way, I thought removing the gauge due to expiration would be likely very easy to implement as some setting you can pass on the Builder, much like opting-in for strong references is, and it would probably handle most cases in this GH Issue.

For the time being, I added a Cleaner to remove the corresponding Gauge if a Handle is GC'd, but that's not ideal either (extra threads, another change in resource usage, etc.)

Would you (or some other contributor) be willing to review it if I were to code a PoC + test suite of such implementation?

andrebrait avatar Sep 12 '25 20:09 andrebrait

I thought removing the gauge due to expiration would be likely very easy to implement

Since meters don't have references to the registry, I doubt this would be very easy. You might also want to subscribe to this: https://github.com/micrometer-metrics/micrometer/issues/1114

Since we have limited resources and the use-case is narrow, I would keep this issue open to see if there will be a growing user-interest and use-cases before accepting a PR.

jonatan-ivanov avatar Sep 24 '25 23:09 jonatan-ivanov