prometheus-engine TLS Error configuring a PodMonitoring resource in GKE Autopilot cluster

Hi,

I am attempting to follow the steps here to configure managed collection on GKE Autopilot.

When attempting to apply any PodMonitoring resource, I get the following error: Error from server (InternalError): error when creating "https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.3.1/examples/pod-monitoring.yaml": Internal error occurred: failed calling webhook "default.podmonitorings.gmp-operator.gke-gmp-system.monitoring.googleapis.com": failed to call webhook: Post "https://gmp-operator.gke-gmp-system.svc:443/default/monitoring.googleapis.com/v1/podmonitorings?timeout=10s": x509: certificate is valid for gmp-operator, gmp-operator.gmp-system, gmp-operator.gmp-system.svc, not gmp-operator.gke-gmp-system.svc

In the logs for the gmp-operator in the gke-gmp-system namespace I see the following errors:

"validatingwebhookconfigurations.admissionregistration.k8s.io "gmp-operator.gmp-system.monitoring.googleapis.com" is forbidden: User "system:serviceaccount:gke-gmp-system:operator" cannot get resource "validatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope"
- "Setting CA bundle for ValidatingWebhookConfiguration failed"
"mutatingwebhookconfigurations.admissionregistration.k8s.io "gmp-operator.gmp-system.monitoring.googleapis.com" is forbidden: User "system:serviceaccount:gke-gmp-system:operator" cannot get resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope"
- "Setting CA bundle for MutatingWebhookConfiguration failed"

This seems in some ways similar to the following issues:

https://github.com/GoogleCloudPlatform/prometheus-engine/issues/151
https://github.com/GoogleCloudPlatform/prometheus-engine/issues/178
https://github.com/GoogleCloudPlatform/prometheus-engine/issues/186

but it is notably different since it is a certificate error and not a timeout.

Aug 08 '22 13:08 JishnuM

Hi Jishnu,

GMP Autopilot is rolling out as we speak to clusters >= 1.23. Once that's done, it'll be enabled by default, and you should be able to just start deploying PodMonitoring CRs. You caught this at a strange time mid-rollout where weirdness like this can happen.

We may have to actually disable that checkbox to prevent this error from happening, so thanks for the tip.

Aug 08 '22 15:08 lyanco

I'm just trying to use the example app after enabling GMP on a GKE cluster and it's failing

% k apply -f pod-monitoring.yaml 
Error from server (InternalError): error when creating "pod-monitoring.yaml": Internal error occurred: failed calling webhook "default.podmonitorings.gmp-operator.gke-gmp-system.monitoring.googleapis.com": failed to call webhook: Post "https://gmp-operator.gke-gmp-system.svc:443/default/monitoring.googleapis.com/v1/podmonitorings?timeout=10s": x509: certificate is valid for gmp-operator, gmp-operator.gmp-system, gmp-operator.gmp-system.svc, not gmp-operator.gke-gmp-system.svc

% kubectl -n gke-gmp-system port-forward svc/gmp-operator 8443 
error: Pod 'gmp-operator-647b69cbdd-vdkt5' does not have a named port 'webhook'

Aug 09 '22 06:08 rojomisin

The checkbox to enable GMP in Autopilot isn't functional. When the rollout is completed, GMP will be enabled by default.

Aug 09 '22 13:08 lyanco

Surely it does something, I've been toggling it back and forth a few times with 30 minute "grey busy notification" cluster messages in between. 🤔 Rollout completion ETD?

Aug 09 '22 16:08 rojomisin

It is a bug - that checkbox doesn't do anything on AP clusters besides put you in a broken state. We're going to disable it.

All AP clusters >=1.23 should have GMP on by default by end of this week.

Aug 09 '22 17:08 lyanco

Another victim of this footgun checking in.

Aug 09 '22 22:08 akettmann-e24

It is a bug - that checkbox doesn't do anything on AP clusters besides put you in a broken state. We're going to disable it.

All AP clusters >=1.23 should have GMP on by default by end of this week.

To confirm, what is the status for AP cluster = 1.22 ? Is the checkbox still non functional? Is there a way to get Managed Prometheus ?

Aug 10 '22 00:08 JishnuM

@lyanco

+1 for this bug, so now we should rollback this checkbox to make it work in future release and to be not in a broken state, right?

Btw, my current project is on v1.21, and we cannot update due to usage of deprecated API, but afaict it must be one of autopilot's components that use it. So autopilot component use deprecated API, and it cannot be updated. We do not have any paid google support, maybe you can tell where should we report a bug for this then?

Aug 10 '22 11:08 iamolegga

GMP will only be on for clusters 1.23+, but the AP regular channel is updating to 1.23 at the end of this month... so the fact that it isn't in 1.22 will be super transient.

@iamolegga yes I would uncheck it for now. We'll leave the gcloud... --disable-managed-prometheus option in which will do the same thing, so people will be able to get themselves out of this bug.

If you email me details of your logs I can try to forward it to the right team... email is github username at the company i work for's domain

Aug 10 '22 14:08 lyanco

Closing - the pencil has been disabled.

If you enabled this previously and need to disable it to get you out of the broken state, you can run gcloud beta container clusters update CLUSTER_NAME --disable-managed-prometheus

Aug 10 '22 17:08 lyanco

@lyanco just tried with the cluster's version 1.23.8-gke.1900 and got the same error

Aug 12 '22 14:08 iamolegga

It's not fully rolled out yet.

Aug 12 '22 14:08 lyanco

so this is for Rapid channel only?

Aug 12 '22 17:08 rojomisin

@rojomisin yes, but still not available 😅 checked just now

Aug 12 '22 17:08 iamolegga

Regular should be upgraded to 1.23 by end of August, I believe... so should be pretty much all AP clusters soon.

Aug 12 '22 18:08 lyanco

Can you help me understand how GKE Autopilot STABLE channel clusters now can use managed prometheus?

I've gone through the steps, but I continue to get the error relating to the TLS cert not matching

Error from server (InternalError): error when creating "ingress-nginx/metrics.yaml": Internal error occurred: failed calling webhook "default.podmonitorings.gmp-operator.gke-gmp-system.monitoring.googleapis.com": Post "https://gmp-operator.gke-gmp-system.svc:443/default/monitoring.googleapis.com/v1/podmonitorings?timeout=10s": x509: certificate is valid for gmp-operator, gmp-operator.gmp-system, gmp-operator.gmp-system.svc, not gmp-operator.gke-gmp-system.svc

Is this guide on setting up a managed collection applicable to stable gke autopilot? https://cloud.google.com/stackdriver/docs/managed-prometheus/setup-managed

I dont know if we can wait a few weeks but the instructions seem to be in disarray, please point me to a working setup guide for stable if you can.

Aug 12 '22 21:08 rojomisin

Hey all - we had an issue with the rollout and GMP is still not usable on AP clusters. Will update you here when I have an estimated fix date. It's a top priority.

Re: Stable channel, I believe the stable channel will update to 1.23.6 by the end of this month, so it should be a transitory issue.

Aug 15 '22 15:08 lyanco

thanks for the update, we look forwards to being able to use GMP in AP soon, however we do not want to use the Rapid or Regular channels in production.

If stable does release update to 1.23.6 by, let's say 9/1, then why is a new AP stable cluster deploying version 1.21? In other words, how can we configure the AP stable clusters now to be able to say "let me have the 1.23 stable version, rather than 1.21" when it's released? 🤔

Additionally, does that mean for a stable AP production cluster we would want to use an unmanaged collection until the AP GMP rollout is resolved and readily available in Stable channel?

Aug 15 '22 18:08 rojomisin

AP clusters are statically versioned; all stable clusters will be moved to 1.23 when they upgrade the channel. You can't choose versions within the channel.

We're working on fixing AP as soon as we can; I'll let you know when I have a target date for the stable channel upgrading.

Aug 15 '22 19:08 lyanco

Keeping an eye on this thread to use GMP on a regular channel private AP cluster

Aug 22 '22 18:08 mimizone

as a workaround, because stable wont be ready for a few months i'm guessing, we modified the prometheus forked manifest, and it works well

You just have to setup the workload identity for the metrics namespace https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#authenticating_to

# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https:#www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: metrics:prometheus
rules:
- apiGroups: [""]
  resources:
  - pods
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: metrics:prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: metrics:prometheus
subjects:
- kind: ServiceAccount
  namespace: metrics
  name: default
---
apiVersion: v1
kind: Service
metadata:
  namespace: metrics
  name: prometheus
  labels:
    prometheus: gmp
spec:
  type: ClusterIP
  selector:
    app: prometheus
    prometheus: gmp
  ports:
  - name: web
    port: 9090
    targetPort: web
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: metrics
  name: prometheus
  labels:
    prometheus: gmp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
      prometheus: gmp
  serviceName: prometheus
  template:
    metadata:
      labels:
        app: prometheus
        prometheus: gmp
    spec:
      automountServiceAccountToken: true
      containers:
      - name: prometheus
        image: gke.gcr.io/prometheus-engine/prometheus:v2.35.0-gmp.2-gke.0
        args:
        - --config.file=/prometheus/config_out/config.yaml
        - --storage.tsdb.path=/prometheus/data
        - --storage.tsdb.retention.time=24h
        - --web.enable-lifecycle
        - --storage.tsdb.no-lockfile
        - --web.route-prefix=/
        ports:
        - name: web
          containerPort: 9090
        readinessProbe:
          httpGet:
            path: /-/ready
            port: web
            scheme: HTTP
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
        volumeMounts:
        - name: config-out
          mountPath: /prometheus/config_out
          readOnly: true
        - name: prometheus-db
          mountPath: /prometheus/data
      - name: config-reloader
        image: gke.gcr.io/prometheus-engine/config-reloader:v0.4.3-gke.0
        args:
        - --config-file=/prometheus/config/config.yaml
        - --config-file-output=/prometheus/config_out/config.yaml
        - --reload-url=http://localhost:9090/-/reload
        - --listen-address=:19091
        ports:
        - name: reloader-web
          containerPort: 8080
        resources:
          requests:
            cpu: 250m
            memory: 500Mi
        volumeMounts:
        - name: config
          mountPath: /prometheus/config
        - name: config-out
          mountPath: /prometheus/config_out
      terminationGracePeriodSeconds: 600
      volumes:
      - name: prometheus-db
        emptyDir: {}
      - name: config
        configMap:
          name: prometheus
          defaultMode: 420
      - name: config-out
        emptyDir: {}
---
apiVersion: v1
kind: ConfigMap
metadata:
  namespace: metrics
  name: prometheus
  labels:
    prometheus: gmp
data:
  config.yaml: |
    global:
      scrape_interval: 60s

    scrape_configs:
    # Let Prometheus scrape itself.
    - job_name: prometheus
      static_configs:
      - targets: ['localhost:9090']
    # Scrape pods with label app=web across all namespaces on the port 4000
    - job_name: web
      metrics_path: /metrics
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        regex: web
        action: keep
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name, __meta_kubernetes_pod_container_port_name]
        regex: (.+);(.+)
        target_label: instance
        replacement: $1:$2
        action: replace
      - source_labels: [__meta_kubernetes_pod_container_port_name]
        regex: web
        action: keep

Aug 26 '22 18:08 rojomisin

This is still broken in GKE autopilot 1.24. Not sure why this was closed?

Sep 20 '22 14:09 philip-harvey

I'll reopen to track the overall AP initiative. Still working on it.

Sep 20 '22 14:09 lyanco

It is especially nefarious in GKE Autopilot 1.24 because workload metrics have been deprecated.

Sep 20 '22 14:09 brokenjacobs

I understand - fixing this on AP is our top priority.

Sep 20 '22 14:09 lyanco

Is this still being worked on? I have the same error.

I'm running v1.23.12-gke.100 on my cluster, with a few workloads that follow this template

---
apiVersion: v1
kind: Service
metadata:
  name: accounts-api
  namespace: app
  labels:
    component: accounts-api
  annotations:
    networking.gke.io/load-balancer-type: "Internal"
    cloud.google.com/neg: '{"ingress": true}'
spec:
  type: LoadBalancer
  selector:
    component: accounts-api
  ports:
    - port: 80
      targetPort: 80
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: accounts-api
  namespace: app
  annotations:
    kubernetes.io/ingress.class: "gce-internal"
spec:
  rules:
    - host: aaaa-accounts-api.clg.nos.internal
      http:
        paths:
          - pathType: Prefix
            path: "/"
            backend:
              service:
                name: aaaa-accounts-api
                port:
                  number: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: accounts-api
  namespace: app
  labels:
    app: accounts-api
spec:
  selector:
    matchLabels:
      component: accounts-api
  template:
    metadata:
      labels:
        component: accounts-api
        istio-injection: enabled
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "80"
        prometheus.io/scrape: "true"
        prometheus.io/alarmgroup: "users"
    spec:
      volumes:
        - name: accounts-configmap
          configMap:
            name: accounts-configmap
      containers:
        - name: accounts-api
          image: my-image
          ports:
            - name: tcp80
              containerPort: 80
              protocol: TCP
          volumeMounts:
            - name: accounts-configmap
              mountPath: /app/appsettings.json
              subPath: appsettings.json
              readOnly: true
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: accounts-api
  namespace: app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: accounts-api
  minReplicas: 1 
  maxReplicas: 10 
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 80
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: accounts-api
  namespace: app
spec:
  selector:
    matchLabels:
      component: accounts-api
  endpoints:
    - port: 80
      interval: 30s
      path: /metrics

Is there anything I can do to workaround this and get prometheus to read the metrics from /metrics at port 80?

What I already tried:

Recreate the cluster
Change services namespace to gke-gmp-system
Create component label in the deployments
Disable managed prometheus and run as manual service.

Oct 11 '22 14:10 phenriques740

Yes, still working on this. AP is tricky, as you've encountered. We're working on making this on by default so all this struggle goes away.

Oct 11 '22 15:10 lyanco

I am guessing that this is already a known issue when running the tutorial on Autopilot?

kubectl -n example apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.5.0/examples/pod-monitoring.yaml
Error from server (InternalError): error when creating "https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.5.0/examples/pod-monitoring.yaml": Internal error occurred: failed calling webhook "default.podmonitorings.gmp-operator.gke-gmp-system.monitoring.googleapis.com": failed to call webhook: Post "https://gmp-operator.gke-gmp-system.svc:443/default/monitoring.googleapis.com/v1/podmonitorings?timeout=10s": x509: certificate signed by unknown authority

Nov 02 '22 14:11 mihai1voicescu

AP still not supported - the latest news is we are almost done with 1.25 support, and then will make it work on 1.24. Stay tuned.

Nov 03 '22 15:11 lyanco

Update here: we have 1.25+ support on Autopilot clusters.

Will keep this issue open as we work on 1.24...

Nov 11 '22 16:11 pintohutch

prometheus-engine prometheus-engine copied to clipboard

TLS Error configuring a PodMonitoring resource in GKE Autopilot cluster

prometheus-engine
prometheus-engine copied to clipboard