prometheus-operator Support the endpointslice role for Prometheus

What is missing?

The operator doesn't support the new endpointslice role available in Prometheus since v2.21.0 (https://github.com/prometheus/prometheus/pull/6838).

Why do we need it?

Endpoint slices are required to scale services with many pods.

Anything else we need to know?:

After discussing on Slack with @brancz, the recommended approach would be that the operator uses seamlessly endpoint slices in lieu of endpoints when it detects that the Kube cluster supports them and Prometheus version is >= 2.21.0.

Feb 22 '21 16:02 simonpasquier

Since k8s 1.19+ enables by default EndpointSliceMirroring Controller which should mirror custom Endpoints to EndpointSlice, maybe it makes sense to investigate if we can switch completely?

Feb 22 '21 16:02 paulfantom

Hi! It would be great to have that, would ease the life of our platform provider team a lot. :+1:

Apr 26 '21 06:04 efasel

Support of endpointSlice with FQDN may also help with scraping external targets and fix https://github.com/prometheus-operator/prometheus-operator/issues/3204

Apr 26 '21 12:04 AntonSmolkov

@efasel @AntonSmolkov would you like to work on this?

Apr 26 '21 14:04 paulfantom

/assign

Aug 11 '21 07:08 mikechengwei

@mikechengwei any progress on this?

Nov 03 '21 12:11 paulfantom

@mikechengwei any progress on this?

I will do it quickly.

Nov 04 '21 15:11 mikechengwei

I'm doing it now.

Jan 28 '22 03:01 mikechengwei

Reopening the issue since #4535 doesn't work as expected (see #4634 for details).

In retrospect, switching automatically to the endpointslice role (if supported by the Kubernetes API) might break existing users if they didn't grant get/list/watch permissions on the endpointslices resources to the Prometheus service account in advance. And in a model where service monitors and prometheus objects aren't managed by the same person, the permissions may even be granted on a case-by-case basis (per namespace). This is the situation in OpenShift for instance where each component is responsible for setting up the service monitor + RBAC (example with the etcd operator with the service monitor, role and role binding).

We've discussed it during the last contributors office hours with @ArthurSens and @fpetkovski and we evaluated 3 options (not exclusive):

We add a field to the Prometheus CRD to define the default Kubernetes SD role (spec.serviceDiscoveryRole for example, defaulting to endpoints).
We add a field to the ServiceMonitor CRD to let users control which role should be used on a per service monitor basis.
The operator checks the permissions of the Prometheus SA using LocalSubjectAccessReview and selects the appropriate role based on the response.

Option 1 is the most straightforward solution, since it's an opt-in, it shouldn't disrupt existing deployments. In case the user has full control on the monitoring objects and RBAC permissions, they can easily switch to endpointslices. And we can update our default deployment to enable the endpointslice role by default. Option 2 offers more control but it adds complexity in the operator code base and it is harder to understand from a user standpoint. Option 3 doesn't involve user intervention but it makes it less obvious for the user, the operator needs additional permissions and it may be prone to errors (in case the assigned roles drift).

If we go with option 1, it would be up to the person managing the Prometheus resource to decide when they are ready to switch from one role to another. For self-service models (like OpenShift), it means that each component has to update their RBACs before the switch can happen but it seems to be acceptable.

Mar 15 '22 17:03 simonpasquier

Reopening the issue since #4535 doesn't work as expected (see #4634 for details).

In retrospect, switching automatically to the endpointslice role (if supported by the Kubernetes API) might break existing users if they didn't grant get/list/watch permissions on the endpointslices resources to the Prometheus service account in advance. And in a model where service monitors and prometheus objects aren't managed by the same person, the permissions may even be granted on a case-by-case basis (per namespace). This is the situation in OpenShift for instance where each component is responsible for setting up the service monitor + RBAC (example with the etcd operator with the service monitor, role and role binding).

We've discussed it during the last contributors office hours with @ArthurSens and @fpetkovski and we evaluated 3 options (not exclusive):

We add a field to the Prometheus CRD to define the default Kubernetes SD role (spec.serviceDiscoveryRole for example, defaulting to endpoints).
We add a field to the ServiceMonitor CRD to let users control which role should be used on a per service monitor basis.
The operator checks the permissions of the Prometheus SA using LocalSubjectAccessReview and selects the appropriate role based on the response.

Option 1 is the most straightforward solution, since it's an opt-in, it shouldn't disrupt existing deployments. In case the user has full control on the monitoring objects and RBAC permissions, they can easily switch to endpointslices. And we can update our default deployment to enable the endpointslice role by default. Option 2 offers more control but it adds complexity in the operator code base and it is harder to understand from a user standpoint. Option 3 doesn't involve user intervention but it makes it less obvious for the user, the operator needs additional permissions and it may be prone to errors (in case the assigned roles drift).

If we go with option 1, it would be up to the person managing the Prometheus resource to decide when they are ready to switch from one role to another. For self-service models (like OpenShift), it means that each component has to update their RBACs before the switch can happen but it seems to be acceptable.

Mar 15 '22 17:03 simonpasquier

We encountered another issue with the current implementation, that is for the automatic Kubelet Endpoint generation that the operator is responsible for.

TL;DR Kubelet Endpoint generation must also be updated to create and manage EndpointSlices for the Kubelets

We forked the operator to manually enable the EndpointSlice support (encountered the same issue seen in #4634), and added the necessary RBAC rules to allow working with EndpointSlices.

When EndpointSlice support is enabled the Prometheus Operator properly creates configuration for Prometheus to use role: endpointslice, however the Prometheus Operator does not change how it creates/manages the Kubelet Endpoints.

See this code: https://github.com/prometheus-operator/prometheus-operator/blob/13953bcd2969973633d42cdffb8dd069df40e190/pkg/prometheus/operator.go#L786-L792

Problem: Even though EndpointSlices support can be enabled, the Kubelet Endpoint will not be created as EndpointSlices.

Instead, the EndpointSlice gets created by EndpointSlice mirroring, which works fine in cases with <1000 Nodes. However as the Node count exceeds 1000, the mirrored EndpointSlice object is limited to 1000.

The Endpoint as created by the Prometheus Operator:

$ kubectl get endpoint -n kube-system -l app.kubernetes.io/managed-by=prometheus-operator,app.kubernetes.io/name=kubelet
NAME                             ENDPOINTS                                                               AGE
k8s-prom-op-prometheus-kubelet   10.51.76.112:10250,10.51.76.18:10250,10.51.76.19:10250 + 4587 more...   369d

The EndpointSlice as created by EndpointSlice mirroring:

$ kubectl get endpointslice -n kube-system -l app.kubernetes.io/managed-by=prometheus-operator,app.kubernetes.io/name=kubelet
NAME                                   ADDRESSTYPE   PORTS              ENDPOINTS                                            AGE
k8s-prom-op-prometheus-kubelet-99zcf   IPv4          10255,10250,4194   10.51.72.60,10.51.74.219,10.51.77.57 + 997 more...   369d

Expected EndPointSlice should look something like (manually crafted):

$ kubectl get endpointslice -n kube-system -l app.kubernetes.io/managed-by=prometheus-operator,app.kubernetes.io/name=kubelet
NAME                                   ADDRESSTYPE   PORTS              ENDPOINTS                                            AGE
k8s-prom-op-prometheus-kubelet-2n8xp   IPv4          10255,10250,4194   10.55.90.46,10.55.90.33,10.55.90.125 + 97 more...    121d
k8s-prom-op-prometheus-kubelet-4gw9z   IPv4          10255,10250,4194   10.55.88.241,10.55.90.13,10.55.91.117 + 97 more...   145d
k8s-prom-op-prometheus-kubelet-5dbl4   IPv4          10255,10250,4194   10.55.90.66,10.55.89.46,10.51.79.28 + 97 more...     119d
k8s-prom-op-prometheus-kubelet-8xbvn   IPv4          10255,10250,4194   10.55.90.32,10.51.79.246,10.55.91.95 + 97 more...    145d
k8s-prom-op-prometheus-kubelet-d9tbh   IPv4          10255,10250,4194   10.55.88.205,10.55.89.157,10.55.91.20 + 97 more...   145d
k8s-prom-op-prometheus-kubelet-ndsq2   IPv4          10255,10250,4194   10.51.79.39,10.55.90.11,10.55.90.207 + 97 more...    82d
k8s-prom-op-prometheus-kubelet-pw267   IPv4          10255,10250,4194   10.55.90.228,10.55.88.75,10.55.89.75 + 97 more...    145d
k8s-prom-op-prometheus-kubelet-tfprz   IPv4          10255,10250,4194   10.55.90.49,10.55.89.173,10.51.72.235 + 97 more...   120d
k8s-prom-op-prometheus-kubelet-v6cvh   IPv4          10255,10250,4194   10.55.88.53,10.55.89.223,10.55.90.67 + 95 more...    172d
k8s-prom-op-prometheus-kubelet-x468h   IPv4          10255,10250,4194   10.55.88.252,10.55.89.47,10.55.88.92 + 97 more...    145d
k8s-prom-op-prometheus-kubelet-xpfhk   IPv4          10255,10250,4194   10.55.91.189,10.51.79.198,10.55.90.236 + 1 more...   82d
k8s-prom-op-prometheus-kubelet-zslxs   IPv4          10255,10250,4194   10.55.88.10,10.55.89.90,10.51.78.108 + 97 more...    122d

Thanks

Sep 26 '22 17:09 eripa

Worth pointing out that eventually many things that use ServiceMonitor can and should switch to using PodMonitor:

https://github.com/prometheus-community/helm-charts/issues/2074

In the meantime, I think Prometheus Operator might want to add the endpointslice.kubernetes.io/skip-mirror: true label to the Endpoints resource that it manages today. That is, it is pointless to allow the EndpointSliceMirroring Controller to create a single EndpointSlices that is truncated to 1000 nodes, as @eripa points out.

Prometheus Operator should also be intelligent enough to use EndpointSlices if they exist, but fall back to Endpoints for the cases where skip-mirror is set.

Sep 26 '22 18:09 briend

We forked the operator to manually enable the EndpointSlice support (encountered the same issue seen in https://github.com/prometheus-operator/prometheus-operator/issues/4634), and added the necessary RBAC rules to allow working with EndpointSlices.

@eripa do you want to contribute your patch here? We'd be more than happy to support endpointslices out of the box.

Sep 27 '22 15:09 simonpasquier

All we did was fix the bug with whats already in prometheus-operator similar to whats happening here:

https://github.com/prometheus-operator/prometheus-operator/pull/4635/files

discovery.k8s.io needs to define a version so just adding discovery.k8s.io/v1 makes prometheus-operator detect endpointslice support and convert service discovery from endpoints to endpointslices in prometheus config.

Works great but 2 new issues arise:

First is prometheus-operator creates/manages endpoints for kubelet only as endpoints... which will be truncated at 1000 when mirrored to endpointslices. So if prom operator is a mode of one or the other it would need to CreateOrUpdateEndpoints to also follow along (endpoints vs endpointslices) here: https://github.com/prometheus-operator/prometheus-operator/blob/9c0db5656f04e005de6a0413fd8eb8f11ec99757/pkg/k8sutil/k8sutil.go#L140-L158

Second is kube-prometheus-stack helm chart needs an additional rbac so prometheus itself can discover endpoint slices:

Something along the lines of this would need to be added to the prometheus service account:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: <cr-name>
rules:
- apiGroups:
  - discovery.k8s.io
  resources:
  - endpointslices
  verbs:
  - get
  - list
  - watch

So in summary; If you just add the rbac for prometheus and fix discovery.k8s.io/v1 in prometheus-operator then everything will work until you hit clusters over 1000 kubelets. After that you'll start losing monitoring of kubelets unless CreateOrUpdateEndpoints wasn't hard coded to Endpoints itself.

Easy enough to change/fix but It was our impression that prometheus-operator had bigger plans. https://github.com/prometheus-operator/prometheus-operator/pull/4635#issuecomment-1066506084

Sep 27 '22 17:09 mkhpalm

TL;DR Kubelet Endpoint generation must also be updated to create and manage EndpointSlices for the Kubelets

@simonpasquier we've also encountered a similar issue during our K8s clusters upgrade to v1.22. As we've few clusters comprising more than 1000 nodes, switching prometheus-operator to endpointslice doesn't work for kubelet target due to endpointslice mirror controller truncating addresses beyond 1000 nodes.

We'd be happy to create a patch to fix the discovery of kubelet addresses to upstream by appropriately publishing EndpointSlices via operator along with existing kubelet Endpoint resource. Let us know if it is not picked up by anyone yet. Thanks.

Oct 29 '22 22:10 iyashu

If you just add the rbac for prometheus and fix discovery.k8s.io/v1 in prometheus-operator ...

How do you realize ’fix discovery.k8s.io/v1 in prometheus-operator‘ ？

I still don't understand how to fix the problem through these two steps.

Thanks.

Nov 03 '22 02:11 haiming3

Let us know if it is not picked up by anyone yet.

@iyashu feel free to open a PR. I don't think that anybody's working on it right now.

Nov 15 '22 08:11 simonpasquier

@simonpasquier @ArthurSens I would like to work on this. As I stated in my GSoC proposal, EndpointSlice may be essential if we want to support ServiceMonitor in DaemonSet mode. Of course, with a new deployment mode it would be simpler. We would just need to tell users that we need EndpointSlice permission. But I want to see how EndpointSlice works in the current modes first to see if there're any problems with it.

So in summary I'm seeing we have two things to do here:

Allow users to specify whether they want to use Endpoints or EndpointSlice role
Manage Kubelet endpoints properly according to the role (Endpoints/EndpointSlice). I.e., EndpointSlice should work with Kubelet targets even when there are more than 1000 nodes.

Regarding the first task, as Simon pointed out here we have three options:

We add a field to the Prometheus CRD to define the default Kubernetes SD role (spec.serviceDiscoveryRole for example, defaulting to endpoints).

We add a field to the ServiceMonitor CRD to let users control which role should be used on a per service monitor basis.

The operator checks the permissions of the Prometheus SA using LocalSubjectAccessReview and selects the appropriate role based on the response.

Do you think we can use both option 1 as the "parent" role, and option 2 as the "child" role? Would it create too much complexity and addtional problems?

Apr 03 '24 08:04 haanhvu

Thanks for chiming in @haanhvu!

Thinking about it again, I'm not sure about option 3 anymore:

it would require prometheus-operator SA to be granted the permission to create SubjectAccessReview which is a high bar requirement IMHO.
it can get complicated as the operator would need to check for each pod/service montor multiplied by the number of namespaces.

I would start with a field in the Prometheus/PrometheusAgent CRD to choose which of endpoints or endpointslices role should be the default choice. We can add the same option to the pod/service monitor CRDs to enable smooth migration: this way, you could flip the endpointsslice toggle at the Prometheus level and continue using endpoints for some monitor resources (the pod/service monitor field would take precedence over the Prometeus field).

Regarding the kubelet controller, the control plane mirrors Endpoints to Endpointslices without problem as long as there are less than 1,000 nodes but above this limit, the mirroring "fails" (not all addresses are mirrored). The solution would be that the kubelet controller (from prometheus-operator) maintains both the endpoints and endpointslices objects. The functionality should be opt-in for now (new CLI argument for the operator) because the current situation works fine for most users and managing the endpointslices adds complexity (and potential bugs) to the operator.

Apr 10 '24 15:04 simonpasquier

@simonpasquier Thanks for the clarification. Those were also the ideas I had in my comment. I'll send PRs for each part shortly.

Apr 15 '24 10:04 haanhvu

The solution would be that the kubelet controller (from prometheus-operator) maintains both the endpoints and endpointslices objects. The functionality should be opt-in for now (new CLI argument for the operator) because the current situation works fine for most users and managing the endpointslices adds complexity (and potential bugs) to the operator.

Agreed, changing the endpoints/endpointSlices update/create at the kubelet controller from being hardcoded to "endpoints" to being based on the opt-in field (endpoint/endpoinSlice) seems like a straightforward change. This way the mirroring cap at 1000 endpoints can be avoided.

Apr 16 '24 19:04 mviswanathsai

@simonpasquier @haanhvu

Has this PR been abandoned? https://github.com/prometheus-operator/prometheus-operator/pull/6518

If so, I will branch off of this and submit a new PR which addresses the concerns raised here: https://github.com/prometheus-operator/prometheus-operator/pull/6518/files#r1572267208

May 29 '24 18:05 thaniri

@thaniri It's not abandoned but I'm having some longer and more urgent tasks to do, so I may not come back to that PR soon. If you could help moving it forward I'd really appreciate^^

May 30 '24 04:05 haanhvu

Agreed, changing the endpoints/endpointSlices update/create at the kubelet controller from being hardcoded to "endpoints" to being based on the opt-in field (endpoint/endpoinSlice) seems like a straightforward change

I wouldn't call it straightforward tbh :)

May 30 '24 06:05 simonpasquier

https://github.com/prometheus-operator/prometheus-operator/pull/6663

Sorry for the delay. Here is a pull request which implements a command line flag to the operator which controls whether or not to generate sd configs with the endpointslice or endpoints roles.

Jun 10 '24 23:06 thaniri

https://github.com/prometheus-operator/prometheus-operator/pull/6672

I have closed PR https://github.com/prometheus-operator/prometheus-operator/pull/6663 and implemented a solution as recommended via plumbing down the configuration through the CRDs.

Jun 12 '24 18:06 thaniri

The support for the endpointslice role in ServiceMonitor and PodMonitor is now merged. The remaining bit is to support endpointslice for the kubelet controller.

Jul 25 '24 07:07 simonpasquier