prometheus-engine
prometheus-engine copied to clipboard
TLS Error configuring a PodMonitoring resource in GKE Autopilot cluster
Hi,
I am attempting to follow the steps here to configure managed collection on GKE Autopilot.
When attempting to apply any PodMonitoring resource, I get the following error:
Error from server (InternalError): error when creating "https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.3.1/examples/pod-monitoring.yaml": Internal error occurred: failed calling webhook "default.podmonitorings.gmp-operator.gke-gmp-system.monitoring.googleapis.com": failed to call webhook: Post "https://gmp-operator.gke-gmp-system.svc:443/default/monitoring.googleapis.com/v1/podmonitorings?timeout=10s": x509: certificate is valid for gmp-operator, gmp-operator.gmp-system, gmp-operator.gmp-system.svc, not gmp-operator.gke-gmp-system.svc
In the logs for the gmp-operator
in the gke-gmp-system
namespace I see the following errors:
-
"validatingwebhookconfigurations.admissionregistration.k8s.io "gmp-operator.gmp-system.monitoring.googleapis.com" is forbidden: User "system:serviceaccount:gke-gmp-system:operator" cannot get resource "validatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope"
-
"Setting CA bundle for ValidatingWebhookConfiguration failed"
-
-
"mutatingwebhookconfigurations.admissionregistration.k8s.io "gmp-operator.gmp-system.monitoring.googleapis.com" is forbidden: User "system:serviceaccount:gke-gmp-system:operator" cannot get resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope"
-
"Setting CA bundle for MutatingWebhookConfiguration failed"
-
This seems in some ways similar to the following issues:
- https://github.com/GoogleCloudPlatform/prometheus-engine/issues/151
- https://github.com/GoogleCloudPlatform/prometheus-engine/issues/178
- https://github.com/GoogleCloudPlatform/prometheus-engine/issues/186
but it is notably different since it is a certificate error and not a timeout.
Hi Jishnu,
GMP Autopilot is rolling out as we speak to clusters >= 1.23. Once that's done, it'll be enabled by default, and you should be able to just start deploying PodMonitoring CRs. You caught this at a strange time mid-rollout where weirdness like this can happen.
We may have to actually disable that checkbox to prevent this error from happening, so thanks for the tip.
I'm just trying to use the example app after enabling GMP on a GKE cluster and it's failing
% k apply -f pod-monitoring.yaml
Error from server (InternalError): error when creating "pod-monitoring.yaml": Internal error occurred: failed calling webhook "default.podmonitorings.gmp-operator.gke-gmp-system.monitoring.googleapis.com": failed to call webhook: Post "https://gmp-operator.gke-gmp-system.svc:443/default/monitoring.googleapis.com/v1/podmonitorings?timeout=10s": x509: certificate is valid for gmp-operator, gmp-operator.gmp-system, gmp-operator.gmp-system.svc, not gmp-operator.gke-gmp-system.svc
% kubectl -n gke-gmp-system port-forward svc/gmp-operator 8443
error: Pod 'gmp-operator-647b69cbdd-vdkt5' does not have a named port 'webhook'
The checkbox to enable GMP in Autopilot isn't functional. When the rollout is completed, GMP will be enabled by default.
Surely it does something, I've been toggling it back and forth a few times with 30 minute "grey busy notification" cluster messages in between. 🤔 Rollout completion ETD?
It is a bug - that checkbox doesn't do anything on AP clusters besides put you in a broken state. We're going to disable it.
All AP clusters >=1.23 should have GMP on by default by end of this week.
Another victim of this footgun checking in.
It is a bug - that checkbox doesn't do anything on AP clusters besides put you in a broken state. We're going to disable it.
All AP clusters >=1.23 should have GMP on by default by end of this week.
To confirm, what is the status for AP cluster = 1.22 ? Is the checkbox still non functional? Is there a way to get Managed Prometheus ?
@lyanco
+1 for this bug, so now we should rollback this checkbox to make it work in future release and to be not in a broken state, right?
Btw, my current project is on v1.21, and we cannot update due to usage of deprecated API, but afaict it must be one of autopilot's components that use it. So autopilot component use deprecated API, and it cannot be updated. We do not have any paid google support, maybe you can tell where should we report a bug for this then?
GMP will only be on for clusters 1.23+, but the AP regular channel is updating to 1.23 at the end of this month... so the fact that it isn't in 1.22 will be super transient.
@iamolegga yes I would uncheck it for now. We'll leave the gcloud... --disable-managed-prometheus
option in which will do the same thing, so people will be able to get themselves out of this bug.
If you email me details of your logs I can try to forward it to the right team... email is github username at the company i work for's domain
Closing - the pencil has been disabled.
If you enabled this previously and need to disable it to get you out of the broken state, you can run gcloud beta container clusters update CLUSTER_NAME --disable-managed-prometheus
@lyanco just tried with the cluster's version 1.23.8-gke.1900
and got the same error
It's not fully rolled out yet.
so this is for Rapid channel only?
@rojomisin yes, but still not available 😅 checked just now
Regular should be upgraded to 1.23 by end of August, I believe... so should be pretty much all AP clusters soon.
Can you help me understand how GKE Autopilot STABLE channel clusters now can use managed prometheus?
I've gone through the steps, but I continue to get the error relating to the TLS cert not matching
Error from server (InternalError): error when creating "ingress-nginx/metrics.yaml": Internal error occurred: failed calling webhook "default.podmonitorings.gmp-operator.gke-gmp-system.monitoring.googleapis.com": Post "https://gmp-operator.gke-gmp-system.svc:443/default/monitoring.googleapis.com/v1/podmonitorings?timeout=10s": x509: certificate is valid for gmp-operator, gmp-operator.gmp-system, gmp-operator.gmp-system.svc, not gmp-operator.gke-gmp-system.svc
Is this guide on setting up a managed collection applicable to stable gke autopilot? https://cloud.google.com/stackdriver/docs/managed-prometheus/setup-managed
I dont know if we can wait a few weeks but the instructions seem to be in disarray, please point me to a working setup guide for stable if you can.
Hey all - we had an issue with the rollout and GMP is still not usable on AP clusters. Will update you here when I have an estimated fix date. It's a top priority.
Re: Stable channel, I believe the stable channel will update to 1.23.6 by the end of this month, so it should be a transitory issue.
thanks for the update, we look forwards to being able to use GMP in AP soon, however we do not want to use the Rapid or Regular channels in production.
If stable does release update to 1.23.6 by, let's say 9/1, then why is a new AP stable cluster deploying version 1.21? In other words, how can we configure the AP stable clusters now to be able to say "let me have the 1.23 stable version, rather than 1.21" when it's released? 🤔
Additionally, does that mean for a stable AP production cluster we would want to use an unmanaged collection until the AP GMP rollout is resolved and readily available in Stable channel?
AP clusters are statically versioned; all stable clusters will be moved to 1.23 when they upgrade the channel. You can't choose versions within the channel.
We're working on fixing AP as soon as we can; I'll let you know when I have a target date for the stable channel upgrading.
Keeping an eye on this thread to use GMP on a regular channel private AP cluster
as a workaround, because stable
wont be ready for a few months i'm guessing, we modified the prometheus forked manifest, and it works well
You just have to setup the workload identity for the metrics
namespace
https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#authenticating_to
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https:#www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: metrics:prometheus
rules:
- apiGroups: [""]
resources:
- pods
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: metrics:prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: metrics:prometheus
subjects:
- kind: ServiceAccount
namespace: metrics
name: default
---
apiVersion: v1
kind: Service
metadata:
namespace: metrics
name: prometheus
labels:
prometheus: gmp
spec:
type: ClusterIP
selector:
app: prometheus
prometheus: gmp
ports:
- name: web
port: 9090
targetPort: web
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
namespace: metrics
name: prometheus
labels:
prometheus: gmp
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
prometheus: gmp
serviceName: prometheus
template:
metadata:
labels:
app: prometheus
prometheus: gmp
spec:
automountServiceAccountToken: true
containers:
- name: prometheus
image: gke.gcr.io/prometheus-engine/prometheus:v2.35.0-gmp.2-gke.0
args:
- --config.file=/prometheus/config_out/config.yaml
- --storage.tsdb.path=/prometheus/data
- --storage.tsdb.retention.time=24h
- --web.enable-lifecycle
- --storage.tsdb.no-lockfile
- --web.route-prefix=/
ports:
- name: web
containerPort: 9090
readinessProbe:
httpGet:
path: /-/ready
port: web
scheme: HTTP
resources:
requests:
cpu: 250m
memory: 512Mi
volumeMounts:
- name: config-out
mountPath: /prometheus/config_out
readOnly: true
- name: prometheus-db
mountPath: /prometheus/data
- name: config-reloader
image: gke.gcr.io/prometheus-engine/config-reloader:v0.4.3-gke.0
args:
- --config-file=/prometheus/config/config.yaml
- --config-file-output=/prometheus/config_out/config.yaml
- --reload-url=http://localhost:9090/-/reload
- --listen-address=:19091
ports:
- name: reloader-web
containerPort: 8080
resources:
requests:
cpu: 250m
memory: 500Mi
volumeMounts:
- name: config
mountPath: /prometheus/config
- name: config-out
mountPath: /prometheus/config_out
terminationGracePeriodSeconds: 600
volumes:
- name: prometheus-db
emptyDir: {}
- name: config
configMap:
name: prometheus
defaultMode: 420
- name: config-out
emptyDir: {}
---
apiVersion: v1
kind: ConfigMap
metadata:
namespace: metrics
name: prometheus
labels:
prometheus: gmp
data:
config.yaml: |
global:
scrape_interval: 60s
scrape_configs:
# Let Prometheus scrape itself.
- job_name: prometheus
static_configs:
- targets: ['localhost:9090']
# Scrape pods with label app=web across all namespaces on the port 4000
- job_name: web
metrics_path: /metrics
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: web
action: keep
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name, __meta_kubernetes_pod_container_port_name]
regex: (.+);(.+)
target_label: instance
replacement: $1:$2
action: replace
- source_labels: [__meta_kubernetes_pod_container_port_name]
regex: web
action: keep
This is still broken in GKE autopilot 1.24. Not sure why this was closed?
I'll reopen to track the overall AP initiative. Still working on it.
It is especially nefarious in GKE Autopilot 1.24 because workload metrics have been deprecated.
I understand - fixing this on AP is our top priority.
Is this still being worked on? I have the same error.
I'm running v1.23.12-gke.100 on my cluster, with a few workloads that follow this template
---
apiVersion: v1
kind: Service
metadata:
name: accounts-api
namespace: app
labels:
component: accounts-api
annotations:
networking.gke.io/load-balancer-type: "Internal"
cloud.google.com/neg: '{"ingress": true}'
spec:
type: LoadBalancer
selector:
component: accounts-api
ports:
- port: 80
targetPort: 80
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: accounts-api
namespace: app
annotations:
kubernetes.io/ingress.class: "gce-internal"
spec:
rules:
- host: aaaa-accounts-api.clg.nos.internal
http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: aaaa-accounts-api
port:
number: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: accounts-api
namespace: app
labels:
app: accounts-api
spec:
selector:
matchLabels:
component: accounts-api
template:
metadata:
labels:
component: accounts-api
istio-injection: enabled
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "80"
prometheus.io/scrape: "true"
prometheus.io/alarmgroup: "users"
spec:
volumes:
- name: accounts-configmap
configMap:
name: accounts-configmap
containers:
- name: accounts-api
image: my-image
ports:
- name: tcp80
containerPort: 80
protocol: TCP
volumeMounts:
- name: accounts-configmap
mountPath: /app/appsettings.json
subPath: appsettings.json
readOnly: true
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: accounts-api
namespace: app
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: accounts-api
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
name: accounts-api
namespace: app
spec:
selector:
matchLabels:
component: accounts-api
endpoints:
- port: 80
interval: 30s
path: /metrics
Is there anything I can do to workaround this and get prometheus to read the metrics from /metrics at port 80?
What I already tried:
- Recreate the cluster
- Change services namespace to gke-gmp-system
- Create component label in the deployments
- Disable managed prometheus and run as manual service.
Yes, still working on this. AP is tricky, as you've encountered. We're working on making this on by default so all this struggle goes away.
I am guessing that this is already a known issue when running the tutorial on Autopilot?
kubectl -n example apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.5.0/examples/pod-monitoring.yaml
Error from server (InternalError): error when creating "https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.5.0/examples/pod-monitoring.yaml": Internal error occurred: failed calling webhook "default.podmonitorings.gmp-operator.gke-gmp-system.monitoring.googleapis.com": failed to call webhook: Post "https://gmp-operator.gke-gmp-system.svc:443/default/monitoring.googleapis.com/v1/podmonitorings?timeout=10s": x509: certificate signed by unknown authority
AP still not supported - the latest news is we are almost done with 1.25 support, and then will make it work on 1.24. Stay tuned.
Update here: we have 1.25+ support on Autopilot clusters.
Will keep this issue open as we work on 1.24...