controller-runtime icon indicating copy to clipboard operation
controller-runtime copied to clipboard

Remove metrics global registry

Open lilic opened this issue 7 years ago • 13 comments

I would like to suggest to remove the global registry as currently the controller-runtime global registry does not capture metrics registered onto prometheus.DefaultRegisterer. As a step forward and in line with the prosed KEP, the controller-runtime should be able to have the registry injected. (for a start with prometheus.DefaultRegisterer, with the long term goal of removing the global registry in the first place, I am open for suggestions on that.)

As per the proposal https://github.com/kubernetes/community/pull/2909:

Kubernetes also makes extensive use of a global metrics registry to register metrics to be exposed. Aside from general shortcomings of global variables, Kubernetes is seeing actual effects of this, causing a number of components to use sync.Once or other mechanisms to ensure to not panic, when registering metrics. Instead a metrics registry should be passed to each component in order to explicitly register metrics instead of through init methods or other global, non-obvious executions.

Lastly, this may be my own opinion, but introducing globals as a library hides its dependencies, therefore I think having the registry be injected makes the API more clear.

lilic avatar Nov 13 '18 11:11 lilic

Disclaimer: the below represents my current thoughts, but I'm generally open to being convinced otherwise :-)

The original goal was to make registering metrics "easy", while also not necessarily polluting the default Prometheus registry. IMO, this is a pattern that the default Prometheus docs get right, especially for smaller projects -- since the metrics registry is in a global, there's very little code or cognitive overhead in registering a new metric -- you just declare it at the top of your file, and start using it. Fundamentally, most programmers are lazy, and so the easier we can make it to add metrics, the more likely people are to do it. I don't want to give people excuses to put off writing metrics until later :-). Overhead leads to all sorts of excuses -- "oh, we can't put metrics here without refactoring to plumb through the registry", "oh we can't put metrics here because this is a free-standing function and not a method", etc. You might still get metrics, but you'll miss out on some that people didn't think were worth the effort.

With that motivation in mind, I think our current setup is a decent compromise -- it doesn't use the default registry -- it's actually separate, so you can choose to use it or not use it. However, you still get a similar feel: instead of just importing prometheus you have to import prometheus and metrics, but you get a similar feel -- declaring metrics.Registry.MustRegister at the top of your file, and then using the metrics as normal.

DirectXMan12 avatar Nov 13 '18 21:11 DirectXMan12

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Apr 26 '19 23:04 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot avatar May 27 '19 00:05 fejta-bot

/remove-lifecycle rotten

Personally, still think we should move away from it :)

lilic avatar May 27 '19 11:05 lilic

yeah, I think this still bears further discussion. Likely what'll happen is that we'll eagerly go the direction that k/k goes -- if it looks like k/k is switching to opencensus/opentelemetry, we'll head in that direction eagerly (I expect we'll need to wait for opentelemetry's reference Go client in september or so, unless the opencensus --> opentelemetry switch is easy). At any rate, I expect when we switch we'll want to go in the direction of something like opentelemetry to make plumbing thing through easier.

DirectXMan12 avatar Jun 04 '19 00:06 DirectXMan12

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Sep 02 '19 01:09 fejta-bot

/lifecycle frozen

still want to do this

DirectXMan12 avatar Sep 03 '19 18:09 DirectXMan12

Might be related / superseded by #305

/priority backlog

vincepri avatar Feb 20 '20 18:02 vincepri

A backwards-compatible approach to capture metrics from both controller-runtimes registry and prometheus default registry would be to combine them via the prometheus Gatherers

alvaroaleman avatar Feb 20 '20 18:02 alvaroaleman

It seems Kubernetes is removing global metrics, if I'm reading the KEP correctly https://github.com/kubernetes/enhancements/blob/master/keps/sig-instrumentation/20191028-metrics-stability-to-beta.md - first goal:

No Kubernetes binaries register metrics to prometheus registries directly.

As a user of any library that has global state, I can say that I always wish it didn't have it.

While having global "things" makes it "easy" to start using them (e.g. just import and register a metric!), medium and long term this approach from my personal experience doesn't actually make things simple, quite the opposite. Global loggers, registries, http muxes, default clients, etc all make libraries harder to compose, understand and test. It's very unexpected that things start happening when you just import some package to use a constant, function or type from it. There are a few examples of this being an issue in kubernetes alone - forked glog to remove global flag registration, now removing global metrics registrie. Also related - https://github.com/json-iterator/go/issues/265 - global state makes things fragile (found while hacking on Kubernetes).

Just my $0.02, sorry for the rant :)

ash2k avatar Jul 24 '20 06:07 ash2k

This ask is important.

In some cases it's required for us to reload the controllers. Due to an added limitation on the controller name (https://github.com/kubernetes-sigs/controller-runtime/commit/2b941650bce159006c88bd3ca0d132c7bc40e947), restarting the controller using the same name may fail. This forced us to use random suffix for naming. That in return creates a memory leak such that each restart of a controller starts collecting metrics under a new name, which keeps accumulating.

tzvatot avatar Feb 18 '25 13:02 tzvatot

I would recommend skipping the name validation (https://github.com/kubernetes-sigs/controller-runtime/pull/2902#discussion_r1837540560) instead of using a random suffix for controller names which has a bunch of downsides for logs and metrics

sbueringer avatar Feb 19 '25 07:02 sbueringer

@alvaroaleman should we mention the skip option in the name validation error? I'm now a bit concerned that folks use random controller names if they are not aware of the skip option

sbueringer avatar Feb 19 '25 07:02 sbueringer