Support the endpointslice role for Prometheus
What is missing?
The operator doesn't support the new endpointslice role available in Prometheus since v2.21.0 (https://github.com/prometheus/prometheus/pull/6838).
Why do we need it?
Endpoint slices are required to scale services with many pods.
Anything else we need to know?:
After discussing on Slack with @brancz, the recommended approach would be that the operator uses seamlessly endpoint slices in lieu of endpoints when it detects that the Kube cluster supports them and Prometheus version is >= 2.21.0.
Since k8s 1.19+ enables by default EndpointSliceMirroring Controller which should mirror custom Endpoints to EndpointSlice, maybe it makes sense to investigate if we can switch completely?
Hi! It would be great to have that, would ease the life of our platform provider team a lot. :+1:
Support of endpointSlice with FQDN may also help with scraping external targets and fix https://github.com/prometheus-operator/prometheus-operator/issues/3204
@efasel @AntonSmolkov would you like to work on this?
/assign
@mikechengwei any progress on this?
@mikechengwei any progress on this?
I will do it quickly.
I'm doing it now.
Reopening the issue since #4535 doesn't work as expected (see #4634 for details).
In retrospect, switching automatically to the endpointslice role (if supported by the Kubernetes API) might break existing users if they didn't grant get/list/watch permissions on the endpointslices resources to the Prometheus service account in advance. And in a model where service monitors and prometheus objects aren't managed by the same person, the permissions may even be granted on a case-by-case basis (per namespace). This is the situation in OpenShift for instance where each component is responsible for setting up the service monitor + RBAC (example with the etcd operator with the service monitor, role and role binding).
We've discussed it during the last contributors office hours with @ArthurSens and @fpetkovski and we evaluated 3 options (not exclusive):
- We add a field to the Prometheus CRD to define the default Kubernetes SD role (
spec.serviceDiscoveryRolefor example, defaulting to endpoints). - We add a field to the ServiceMonitor CRD to let users control which role should be used on a per service monitor basis.
- The operator checks the permissions of the Prometheus SA using
LocalSubjectAccessReviewand selects the appropriate role based on the response.
Option 1 is the most straightforward solution, since it's an opt-in, it shouldn't disrupt existing deployments. In case the user has full control on the monitoring objects and RBAC permissions, they can easily switch to endpointslices. And we can update our default deployment to enable the endpointslice role by default. Option 2 offers more control but it adds complexity in the operator code base and it is harder to understand from a user standpoint. Option 3 doesn't involve user intervention but it makes it less obvious for the user, the operator needs additional permissions and it may be prone to errors (in case the assigned roles drift).
If we go with option 1, it would be up to the person managing the Prometheus resource to decide when they are ready to switch from one role to another. For self-service models (like OpenShift), it means that each component has to update their RBACs before the switch can happen but it seems to be acceptable.
Reopening the issue since #4535 doesn't work as expected (see #4634 for details).
In retrospect, switching automatically to the endpointslice role (if supported by the Kubernetes API) might break existing users if they didn't grant get/list/watch permissions on the endpointslices resources to the Prometheus service account in advance. And in a model where service monitors and prometheus objects aren't managed by the same person, the permissions may even be granted on a case-by-case basis (per namespace). This is the situation in OpenShift for instance where each component is responsible for setting up the service monitor + RBAC (example with the etcd operator with the service monitor, role and role binding).
We've discussed it during the last contributors office hours with @ArthurSens and @fpetkovski and we evaluated 3 options (not exclusive):
- We add a field to the Prometheus CRD to define the default Kubernetes SD role (
spec.serviceDiscoveryRolefor example, defaulting to endpoints). - We add a field to the ServiceMonitor CRD to let users control which role should be used on a per service monitor basis.
- The operator checks the permissions of the Prometheus SA using
LocalSubjectAccessReviewand selects the appropriate role based on the response.
Option 1 is the most straightforward solution, since it's an opt-in, it shouldn't disrupt existing deployments. In case the user has full control on the monitoring objects and RBAC permissions, they can easily switch to endpointslices. And we can update our default deployment to enable the endpointslice role by default. Option 2 offers more control but it adds complexity in the operator code base and it is harder to understand from a user standpoint. Option 3 doesn't involve user intervention but it makes it less obvious for the user, the operator needs additional permissions and it may be prone to errors (in case the assigned roles drift).
If we go with option 1, it would be up to the person managing the Prometheus resource to decide when they are ready to switch from one role to another. For self-service models (like OpenShift), it means that each component has to update their RBACs before the switch can happen but it seems to be acceptable.
We encountered another issue with the current implementation, that is for the automatic Kubelet Endpoint generation that the operator is responsible for.
TL;DR Kubelet Endpoint generation must also be updated to create and manage EndpointSlices for the Kubelets
We forked the operator to manually enable the EndpointSlice support (encountered the same issue seen in #4634), and added the necessary RBAC rules to allow working with EndpointSlices.
When EndpointSlice support is enabled the Prometheus Operator properly creates configuration for Prometheus to use role: endpointslice, however the Prometheus Operator does not change how it creates/manages the Kubelet Endpoints.
See this code: https://github.com/prometheus-operator/prometheus-operator/blob/13953bcd2969973633d42cdffb8dd069df40e190/pkg/prometheus/operator.go#L786-L792
Problem: Even though EndpointSlices support can be enabled, the Kubelet Endpoint will not be created as EndpointSlices.
Instead, the EndpointSlice gets created by EndpointSlice mirroring, which works fine in cases with <1000 Nodes. However as the Node count exceeds 1000, the mirrored EndpointSlice object is limited to 1000.
The Endpoint as created by the Prometheus Operator:
$ kubectl get endpoint -n kube-system -l app.kubernetes.io/managed-by=prometheus-operator,app.kubernetes.io/name=kubelet
NAME ENDPOINTS AGE
k8s-prom-op-prometheus-kubelet 10.51.76.112:10250,10.51.76.18:10250,10.51.76.19:10250 + 4587 more... 369d
The EndpointSlice as created by EndpointSlice mirroring:
$ kubectl get endpointslice -n kube-system -l app.kubernetes.io/managed-by=prometheus-operator,app.kubernetes.io/name=kubelet
NAME ADDRESSTYPE PORTS ENDPOINTS AGE
k8s-prom-op-prometheus-kubelet-99zcf IPv4 10255,10250,4194 10.51.72.60,10.51.74.219,10.51.77.57 + 997 more... 369d
Expected EndPointSlice should look something like (manually crafted):
$ kubectl get endpointslice -n kube-system -l app.kubernetes.io/managed-by=prometheus-operator,app.kubernetes.io/name=kubelet
NAME ADDRESSTYPE PORTS ENDPOINTS AGE
k8s-prom-op-prometheus-kubelet-2n8xp IPv4 10255,10250,4194 10.55.90.46,10.55.90.33,10.55.90.125 + 97 more... 121d
k8s-prom-op-prometheus-kubelet-4gw9z IPv4 10255,10250,4194 10.55.88.241,10.55.90.13,10.55.91.117 + 97 more... 145d
k8s-prom-op-prometheus-kubelet-5dbl4 IPv4 10255,10250,4194 10.55.90.66,10.55.89.46,10.51.79.28 + 97 more... 119d
k8s-prom-op-prometheus-kubelet-8xbvn IPv4 10255,10250,4194 10.55.90.32,10.51.79.246,10.55.91.95 + 97 more... 145d
k8s-prom-op-prometheus-kubelet-d9tbh IPv4 10255,10250,4194 10.55.88.205,10.55.89.157,10.55.91.20 + 97 more... 145d
k8s-prom-op-prometheus-kubelet-ndsq2 IPv4 10255,10250,4194 10.51.79.39,10.55.90.11,10.55.90.207 + 97 more... 82d
k8s-prom-op-prometheus-kubelet-pw267 IPv4 10255,10250,4194 10.55.90.228,10.55.88.75,10.55.89.75 + 97 more... 145d
k8s-prom-op-prometheus-kubelet-tfprz IPv4 10255,10250,4194 10.55.90.49,10.55.89.173,10.51.72.235 + 97 more... 120d
k8s-prom-op-prometheus-kubelet-v6cvh IPv4 10255,10250,4194 10.55.88.53,10.55.89.223,10.55.90.67 + 95 more... 172d
k8s-prom-op-prometheus-kubelet-x468h IPv4 10255,10250,4194 10.55.88.252,10.55.89.47,10.55.88.92 + 97 more... 145d
k8s-prom-op-prometheus-kubelet-xpfhk IPv4 10255,10250,4194 10.55.91.189,10.51.79.198,10.55.90.236 + 1 more... 82d
k8s-prom-op-prometheus-kubelet-zslxs IPv4 10255,10250,4194 10.55.88.10,10.55.89.90,10.51.78.108 + 97 more... 122d
Thanks
Worth pointing out that eventually many things that use ServiceMonitor can and should switch to using PodMonitor:
https://github.com/prometheus-community/helm-charts/issues/2074
In the meantime, I think Prometheus Operator might want to add the endpointslice.kubernetes.io/skip-mirror: true label to the Endpoints resource that it manages today. That is, it is pointless to allow the EndpointSliceMirroring Controller to create a single EndpointSlices that is truncated to 1000 nodes, as @eripa points out.
Prometheus Operator should also be intelligent enough to use EndpointSlices if they exist, but fall back to Endpoints for the cases where skip-mirror is set.
We forked the operator to manually enable the EndpointSlice support (encountered the same issue seen in https://github.com/prometheus-operator/prometheus-operator/issues/4634), and added the necessary RBAC rules to allow working with EndpointSlices.
@eripa do you want to contribute your patch here? We'd be more than happy to support endpointslices out of the box.
All we did was fix the bug with whats already in prometheus-operator similar to whats happening here:
https://github.com/prometheus-operator/prometheus-operator/pull/4635/files
discovery.k8s.io needs to define a version so just adding discovery.k8s.io/v1 makes prometheus-operator detect endpointslice support and convert service discovery from endpoints to endpointslices in prometheus config.
Works great but 2 new issues arise:
First is prometheus-operator creates/manages endpoints for kubelet only as endpoints... which will be truncated at 1000 when mirrored to endpointslices. So if prom operator is a mode of one or the other it would need to CreateOrUpdateEndpoints to also follow along (endpoints vs endpointslices) here: https://github.com/prometheus-operator/prometheus-operator/blob/9c0db5656f04e005de6a0413fd8eb8f11ec99757/pkg/k8sutil/k8sutil.go#L140-L158
Something along the lines of this would need to be added to the prometheus service account:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: <cr-name>
rules:
- apiGroups:
- discovery.k8s.io
resources:
- endpointslices
verbs:
- get
- list
- watch
So in summary; If you just add the rbac for prometheus and fix discovery.k8s.io/v1 in prometheus-operator then everything will work until you hit clusters over 1000 kubelets. After that you'll start losing monitoring of kubelets unless CreateOrUpdateEndpoints wasn't hard coded to Endpoints itself.
Easy enough to change/fix but It was our impression that prometheus-operator had bigger plans. https://github.com/prometheus-operator/prometheus-operator/pull/4635#issuecomment-1066506084
TL;DR Kubelet Endpoint generation must also be updated to create and manage EndpointSlices for the Kubelets
@simonpasquier we've also encountered a similar issue during our K8s clusters upgrade to v1.22. As we've few clusters comprising more than 1000 nodes, switching prometheus-operator to endpointslice doesn't work for kubelet target due to endpointslice mirror controller truncating addresses beyond 1000 nodes.
We'd be happy to create a patch to fix the discovery of kubelet addresses to upstream by appropriately publishing EndpointSlices via operator along with existing kubelet Endpoint resource. Let us know if it is not picked up by anyone yet. Thanks.
If you just add the rbac for prometheus and fix discovery.k8s.io/v1 in prometheus-operator ...
How do you realize ’fix discovery.k8s.io/v1 in prometheus-operator‘ ?
I still don't understand how to fix the problem through these two steps.
Thanks.
Let us know if it is not picked up by anyone yet.
@iyashu feel free to open a PR. I don't think that anybody's working on it right now.
@simonpasquier @ArthurSens I would like to work on this. As I stated in my GSoC proposal, EndpointSlice may be essential if we want to support ServiceMonitor in DaemonSet mode. Of course, with a new deployment mode it would be simpler. We would just need to tell users that we need EndpointSlice permission. But I want to see how EndpointSlice works in the current modes first to see if there're any problems with it.
So in summary I'm seeing we have two things to do here:
- Allow users to specify whether they want to use Endpoints or EndpointSlice role
- Manage Kubelet endpoints properly according to the role (Endpoints/EndpointSlice). I.e., EndpointSlice should work with Kubelet targets even when there are more than 1000 nodes.
Regarding the first task, as Simon pointed out here we have three options:
- We add a field to the Prometheus CRD to define the default Kubernetes SD role (spec.serviceDiscoveryRole for example, defaulting to endpoints).
- We add a field to the ServiceMonitor CRD to let users control which role should be used on a per service monitor basis.
- The operator checks the permissions of the Prometheus SA using LocalSubjectAccessReview and selects the appropriate role based on the response.
Do you think we can use both option 1 as the "parent" role, and option 2 as the "child" role? Would it create too much complexity and addtional problems?
Thanks for chiming in @haanhvu!
Thinking about it again, I'm not sure about option 3 anymore:
- it would require prometheus-operator SA to be granted the permission to create
SubjectAccessReviewwhich is a high bar requirement IMHO. - it can get complicated as the operator would need to check for each pod/service montor multiplied by the number of namespaces.
I would start with a field in the Prometheus/PrometheusAgent CRD to choose which of endpoints or endpointslices role should be the default choice. We can add the same option to the pod/service monitor CRDs to enable smooth migration: this way, you could flip the endpointsslice toggle at the Prometheus level and continue using endpoints for some monitor resources (the pod/service monitor field would take precedence over the Prometeus field).
Regarding the kubelet controller, the control plane mirrors Endpoints to Endpointslices without problem as long as there are less than 1,000 nodes but above this limit, the mirroring "fails" (not all addresses are mirrored). The solution would be that the kubelet controller (from prometheus-operator) maintains both the endpoints and endpointslices objects. The functionality should be opt-in for now (new CLI argument for the operator) because the current situation works fine for most users and managing the endpointslices adds complexity (and potential bugs) to the operator.
@simonpasquier Thanks for the clarification. Those were also the ideas I had in my comment. I'll send PRs for each part shortly.
The solution would be that the kubelet controller (from prometheus-operator) maintains both the endpoints and endpointslices objects. The functionality should be opt-in for now (new CLI argument for the operator) because the current situation works fine for most users and managing the endpointslices adds complexity (and potential bugs) to the operator.
Agreed, changing the endpoints/endpointSlices update/create at the kubelet controller from being hardcoded to "endpoints" to being based on the opt-in field (endpoint/endpoinSlice) seems like a straightforward change. This way the mirroring cap at 1000 endpoints can be avoided.
@simonpasquier @haanhvu
Has this PR been abandoned? https://github.com/prometheus-operator/prometheus-operator/pull/6518
If so, I will branch off of this and submit a new PR which addresses the concerns raised here: https://github.com/prometheus-operator/prometheus-operator/pull/6518/files#r1572267208
@thaniri It's not abandoned but I'm having some longer and more urgent tasks to do, so I may not come back to that PR soon. If you could help moving it forward I'd really appreciate^^
Agreed, changing the endpoints/endpointSlices update/create at the kubelet controller from being hardcoded to "endpoints" to being based on the opt-in field (endpoint/endpoinSlice) seems like a straightforward change
I wouldn't call it straightforward tbh :)
https://github.com/prometheus-operator/prometheus-operator/pull/6663
Sorry for the delay. Here is a pull request which implements a command line flag to the operator which controls whether or not to generate sd configs with the endpointslice or endpoints roles.
https://github.com/prometheus-operator/prometheus-operator/pull/6672
I have closed PR https://github.com/prometheus-operator/prometheus-operator/pull/6663 and implemented a solution as recommended via plumbing down the configuration through the CRDs.
The support for the endpointslice role in ServiceMonitor and PodMonitor is now merged. The remaining bit is to support endpointslice for the kubelet controller.