rukpak icon indicating copy to clipboard operation
rukpak copied to clipboard

Introduce framework to allow emitting promethues metrics from the plain provisioner

Open anik120 opened this issue 3 years ago • 7 comments

anik120 avatar Apr 26 '22 19:04 anik120

Controller-runtime already has a metrics server built in, so I'd push back a little on the need for an entire framework (since I think we already have most of one).

Also, I'm going to make yet another plug for kubebuilder/operator-sdk, which has a bunch of niceties around metrics (e.g. scaffolding that includes kube-rbac-proxy to protect the metrics endpoint and a ServiceMonitor object that can be used to automatically configure a prometheus-operator controlled prometheus instance to scrape our metrics)

joelanford avatar Apr 26 '22 20:04 joelanford

@joelanford if (read when) we want these metrics to be accessible outside the cluster via a Service, so that we can collect metrics from all the clusters running Bundles, I was under the impression that it's actually a security risk to expose the default port that the controller uses (which is where the controller-runtime metrics are published i.e /metrics). Which, I'm guessing, is probably why kubebuilder/sdk projects include kube-rbac-proxy.

Also, I'm going to make yet another plug for kubebuilder/operator-sdk

I feel like I'm missing some context here, but why wasn't any of kubebuilder/operator-sdk used again? Use cases like "automatically configure a prometheus-operator controlled prometheus instance to scrape our metrics" is going to come up when we go live in prod, and if we were to learn from OLM, we kept manually trying to keep up with these things, and therefore kept missing edge cases/introducing bugs etc.

Metrics is just one component, I'm guessing there will most definitely be other components that'll bring up the same discussion.

cc: @timflannagan @tylerslaton @exdx, in case you guys have the context I'm missing.

anik120 avatar Apr 28 '22 00:04 anik120

From the Kubebuilder metrics guide:

By default, controller-runtime builds a global prometheus registry and publishes a collection of performance metrics for each controller. These metrics are protected by kube-rbac-proxy by default if using kubebuilder. Kubebuilder v2.2.0+ scaffold a clusterrole which can be found at config/rbac/auth_proxy_client_clusterrole.yaml.

From the kube-rbac-proxy project:

Motivation

I developed this proxy in order to be able to protect Prometheus metrics endpoints. In a scenario, where an attacker might obtain full control over a Pod, that attacker would have the ability to discover a lot of information about the workload as well as the current load of the respective workload. This information could originate for example from the node-exporter and kube-state-metrics. Both of those metric sources can commonly be found in Prometheus monitoring stacks on Kubernetes.

This project was created to specifically solve the above problem, however, I felt there is a larger need for such a proxy in general.

anik120 avatar Apr 29 '22 14:04 anik120

Brief note from the upstream OLM working group: let's work towards a model where we invest in metrics sooner-rather-than-later. It sounded like we agreed that c-r metrics should be sufficient in the short term, assuming we back those metrics behind some RBAC/authz/etc. (e.g. kube-rbac-proxy). We can invest further in determine which metrics are needed a couple of metrics down the line once we iron out the API surface for the core rukpak APIs, and the behavior and functionality present in the plain provisioner implementation.

timflannagan avatar May 05 '22 19:05 timflannagan

This issue has become stale because it has been open 60 days with no activity. The maintainers of this repo will remove this label during issue triage. Adding the lifecycle/frozen label will cause this issue to ignore lifecycle events.

github-actions[bot] avatar Sep 07 '22 14:09 github-actions[bot]

Now that some of the more focused on features have gotten into the RukPak project we should consider coming back to this. @anik120 was kind enough to open this issue as well as create a PR for it which is linked above. To get this work over the finish line we should re-evaulate that closed PR (which was closed mainly for age), rebase it, and think of what is needed to get it merged.

Brief note from the upstream OLM working group: let's work towards a model where we invest in metrics sooner-rather-than-later.

This should help us to focus on this point moving forward with the Deppy project as well since we can leverage the work done here there.

tylerslaton avatar Oct 13 '22 19:10 tylerslaton

This issue has become stale because it has been open 60 days with no activity. The maintainers of this repo will remove this label during issue triage or it will be removed automatically after an update. Adding the lifecycle/frozen label will cause this issue to ignore lifecycle events.

github-actions[bot] avatar Dec 13 '22 00:12 github-actions[bot]