kube-prometheus icon indicating copy to clipboard operation
kube-prometheus copied to clipboard

fix: allow opting-into upstream probes

Open rexagod opened this issue 1 year ago • 11 comments

Description

Describe the big picture of your changes here to communicate to the maintainers why we should accept this pull request. If it fixes a bug or resolves a feature request, be sure to link to that issue.

Allow users to opt-into upstream probe definitions.

Type of change

What type of changes does your code introduce to the kube-prometheus? Put an x in the box that apply.

  • [ ] CHANGE (fix or feature that would cause existing functionality to not work as expected)
  • [ ] FEATURE (non-breaking change which adds functionality)
  • [x] BUGFIX (non-breaking change which fixes an issue)
  • [ ] ENHANCEMENT (non-breaking change which improves existing functionality)
  • [ ] NONE (if none of the other choices apply. Example, tooling, build system, CI, docs, etc.)

Changelog entry

Please put a one-line changelog entry below. Later this will be copied to the changelog file.

Users can opt-into the upstreamProbesOptIn (boolean, false by default) for KSM to sync with upstream's health probe definitions.

rexagod avatar Sep 08 '24 21:09 rexagod

Tested downstream (CMO), on enabling upstreamProbesOptIn.

diff --git a/assets/kube-state-metrics/deployment.yaml b/assets/kube-state-metrics/deployment.yaml
index ec651432e..e73091269 100644
--- a/assets/kube-state-metrics/deployment.yaml
+++ b/assets/kube-state-metrics/deployment.yaml
@@ -57,7 +57,19 @@ spec:
           ^kube_pod_completion_time$,
           ^kube_pod_status_scheduled$
         image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.13.0
+        livenessProbe:
+          httpGet:
+            path: /livez
+            port: 8080
+          initialDelaySeconds: 5
+          timeoutSeconds: 5
         name: kube-state-metrics
+        readinessProbe:
+          httpGet:
+            path: /livez
+            port: 8080
+          initialDelaySeconds: 5
+          timeoutSeconds: 5
         resources:
           requests:
             cpu: 2m
diff --git a/jsonnet/main.jsonnet b/jsonnet/main.jsonnet
index 7539fb456..ffabfb4f0 100644
--- a/jsonnet/main.jsonnet
+++ b/jsonnet/main.jsonnet
@@ -216,6 +216,7 @@ local inCluster =
         kubeRbacProxyImage: $.values.common.images.kubeRbacProxy,
         commonLabels+: $.values.common.commonLabels,
         mixin+: { ruleLabels: $.values.common.ruleLabels },
+        upstreamProbesOptIn: true,
       },
       nodeExporter: {
         namespace: $.values.common.namespace,
@@ -439,7 +440,7 @@ local inCluster =
     openshiftStateMetrics: openshiftStateMetrics($.values.openshiftStateMetrics),
   } +
   (import './utils/anti-affinity.libsonnet') +
-  (import 'github.com/prometheus-operator/kube-prometheus/jsonnet/kube-prometheus/addons/ksm-lite.libsonnet') +
+  (import 'github.com/rexagod/kube-prometheus/jsonnet/kube-prometheus/addons/ksm-lite.libsonnet') +
   (import './utils/ibm-cloud-managed-profile.libsonnet') +
   (import './components/metrics-server-audit.libsonnet') +
   {};  // Including empty object to simplify adding and removing imports during development

rexagod avatar Sep 08 '24 23:09 rexagod

  • [x] Add a comment describing the way probes are defined and expected to work.

rexagod avatar Sep 11 '24 07:09 rexagod

@simonpasquier Pushed all the suggested changes but s/defaults.ignorePaths/[] and s/super.ignorePaths/[] since I believe these will always be [], in addition to the current implementation allowing for a single source of truth for the default value?

rexagod avatar Sep 11 '24 10:09 rexagod

I generated the manifest with enableProbes: true but Kubernetes isn't happy with it:

$ kc events
LAST SEEN                TYPE      REASON      OBJECT                                    MESSAGE
6m20s (x2355 over 21h)   Warning   Unhealthy   Pod/kube-state-metrics-68fccdc5f7-9wzdn   Liveness probe errored: strconv.Atoi: parsing "http-metrics": invalid syntax
80s (x2703 over 21h)     Warning   Unhealthy   Pod/kube-state-metrics-68fccdc5f7-9wzdn   Readiness probe errored: strconv.Atoi: parsing "telemetry": invalid syntax

simonpasquier avatar Sep 12 '24 13:09 simonpasquier

Tested on a cluster, seems to work now. I've made KRP's ports the single source of truth for KSM's probes and ports for convenience. PTAL below for downstream diff.

diff --git a/assets/kube-state-metrics/deployment.yaml b/assets/kube-state-metrics/deployment.yaml
index ec651432e..3b444f081 100644
--- a/assets/kube-state-metrics/deployment.yaml
+++ b/assets/kube-state-metrics/deployment.yaml
@@ -57,7 +57,22 @@ spec:
           ^kube_pod_completion_time$,
           ^kube_pod_status_scheduled$
         image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.13.0
+        livenessProbe:
+          httpGet:
+            path: /livez
+            port: http-metrics
+            scheme: HTTPS
         name: kube-state-metrics
+        ports:
+        - containerPort: 8443
+          name: http-metrics
+        - containerPort: 9443
+          name: telemetry
+        readinessProbe:
+          httpGet:
+            path: /readyz
+            port: telemetry
+            scheme: HTTPS
         resources:
           requests:
             cpu: 2m
@@ -75,6 +90,7 @@ spec:
         - --secure-listen-address=:8443
         - --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
         - --upstream=http://127.0.0.1:8081/
+        - --ignore-paths=/livez
         - --tls-cert-file=/etc/tls/private/tls.crt
         - --tls-private-key-file=/etc/tls/private/tls.key
         - --client-ca-file=/etc/tls/client/client-ca.crt
@@ -83,7 +99,7 @@ spec:
         name: kube-rbac-proxy-main
         ports:
         - containerPort: 8443
-          name: https-main
+          name: http-metrics
         resources:
           requests:
             cpu: 1m
@@ -104,6 +120,7 @@ spec:
         - --secure-listen-address=:9443
         - --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
         - --upstream=http://127.0.0.1:8082/
+        - --ignore-paths=/readyz
         - --tls-cert-file=/etc/tls/private/tls.crt
         - --tls-private-key-file=/etc/tls/private/tls.key
         - --client-ca-file=/etc/tls/client/client-ca.crt
@@ -112,7 +129,7 @@ spec:
         name: kube-rbac-proxy-self
         ports:
         - containerPort: 9443
-          name: https-self
+          name: telemetry
         resources:
           requests:
             cpu: 1m

rexagod avatar Sep 16 '24 00:09 rexagod

Needs an approval (for workflows).

rexagod avatar Oct 16 '24 17:10 rexagod

https://github.com/prometheus-operator/kube-prometheus/actions/runs/11370566998/job/31653537858?pr=2505 fails on kube-state-metrics check for metrics.

simonpasquier avatar Oct 17 '24 06:10 simonpasquier

Fixed, needs workflow approval.

rexagod avatar Oct 21 '24 14:10 rexagod

Re-ping, @simonpasquier for another look here. 🙂

rexagod avatar Nov 05 '24 08:11 rexagod

Since port-definitions use the port numbers as keys, I believe there's no workaround to define the same ones on KSM, without server-side apply warnings. I'll open a ticket/PR upstream to see if we cannot address this there.

rexagod avatar Dec 11 '24 06:12 rexagod

Actually, since Kubelet looks at KSM's ports and pings them on the proxy (which then goes through successfully if the port definitions are duplicated) may suggest that we'd need to bring this up to the upstream k/k community to address this particular use-case.

rexagod avatar Dec 11 '24 06:12 rexagod