prometheus-engine
prometheus-engine copied to clipboard
[BUG] apply PodMonitoring | timeout=10s": context deadline exceeded - GKE autopilot
kubectl -n gmp-test apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.3.1/examples/pod-monitoring.yaml
Error from server (InternalError): error when creating "https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.3.1/examples/pod-monitoring.yaml": Internal error occurred: failed calling webhook "default.podmonitorings.gmp-operator.gmp-system.monitoring.googleapis.com": Post "https://gmp-operator.gmp-system.svc:8443/default/monitoring.googleapis.com/v1alpha1/podmonitorings?timeout=10s": context deadline exceeded
Hello,
All PodMonitoring
CRs are checked with a validating webhook that is hosted in the operator pod. A few questions:
- Is the operator pod running without errors?
- Do you have a firewall rule preventing the control plane from talking to the operator pod?
Hello , yes the operator pod is running without error
I do not have any firewall rule
my gke is a GKE auto pilot and there is nothing custom installed inside
and I jus followed this instructions -> https://github.com/GoogleCloudPlatform/prometheus-engine/issues/148#issuecomment-1055599211
https://cloud.google.com/stackdriver/docs/managed-prometheus/setup-managed
There was a similar issue at GoogleCloudPlatform/prometheus-engine#178.
Could you try the instructions from the last comment to see if that helps?
I'm seeing something similar when trying to update the PodMonitoring resources in our GKE cluster. The result of this command:
kubectl get deploy gmp-operator -ngmp-system -ojsonpath="{.spec.template.metadata.annotations['components.gke.io/component-version']}"
is 0.2.9
When trying to run:
kubectl apply -f pod-monitoring.yaml
,
I get the following error message:
Error from server (InternalError): error when creating "pod-monitoring.yaml": Internal error occurred: failed calling webhook "default.podmonitorings.gmp-operator.gmp-system.monitoring.googleapis.com": failed to call webhook: Post "https://gmp-operator.gmp-system.svc:443/default/monitoring.googleapis.com/v1/podmonitorings?timeout=10s": no service port 443 found for service "gmp-operator"
Note the port in the error message - it's looking for the gmp-operator
service running on port 443
. Which is really strange, because the gmp-operator
exposes port 8443
.
If I manually change the gmp-operator
service to listen to port 443
, then the above command to create the PodMonitoring resource works.
We also haven't changed anything in our network configuration & we already had the 8443
firewall rule added to our network configuration. We had previously successfully created PodMonitoring objects in our cluster on ~10th of May.
Hi @bogdan-dumitrescu - thanks for reporting. We recently changed the gmp-operator's port to 443 to avoid having people creating firewall rules. In your case, it seems that a partial-update happened of the component, where the webhooks were updated but the service and deployment were not.
Let me dig into this. I'm curious, does the manual fix get overwritten by the GKE control plane shortly after? Or does your manual fix persist?
@pintohutch Thanks for looking into this! As far as I can see, the manual fix does not get overwritten.
We ran into a similar problem
kubectl -n gmp-test apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.4.1/examples/pod-monitoring.yaml
Error from server (InternalError): error when creating "https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.4.1/examples/pod-monitoring.yaml": Internal error occurred: failed calling webhook "default.podmonitorings.gmp-operator.gmp-system.monitoring.googleapis.com": failed to call webhook: Post "https://gmp-operator.gmp-system.svc:8443/default/monitoring.googleapis.com/v1/podmonitorings?timeout=10s": context deadline exceeded
We are running a private
GKE cluster.
How do we enable this for private
clusters?
Hi @naveensrinivasan - for private clusters you need to open a firewall as documented here: https://cloud.google.com/stackdriver/docs/managed-prometheus/troubleshooting#control-plane-firewall
With our next release, this shouldn't be a requirement btw.
@bogdan-dumitrescu - what happens if you run
kubectl get customresourcedefinition podmonitorings.monitoring.googleapis.com -ojsonpath="{.metadata.annotations['components\.gke\.io/component-version']}"
Would you be able to provide your cluster version? And is this a private cluster? I'm curious if there's any firewall rules blocking those ports.
@pintohutch I get 0.2.10
if I run the above command. We're running a private cluster, version 1.23.6-gke.1700
. Other than the allowing the 8443
port, we did not configure any custom firewall rules.
Gotcha. Thanks for calling this to our attention @bogdan-dumitrescu! There's a minor bug with our upgrade procedure that upgrades everything except for the operator deployment in some cases, which affected you.
I'm glad that you found a remediation to unblock you. We should have a fix for this rolled out soon. Stay tuned.
https://cloud.google.com/stackdriver/docs/managed-prometheus/troubleshooting#control-plane-firewall
That worked. Thanks
@pintohutch
Running into forbidden errors for Prometheus UI and Grafana in Private GKE Clusters
I have the firewall opened.
Is this set to your actual project id? https://github.com/GoogleCloudPlatform/prometheus-engine/blob/main/examples/frontend.yaml#L38
It should be overwritten with your actual project id
Is this set to your actual project id? https://github.com/GoogleCloudPlatform/prometheus-engine/blob/main/examples/frontend.yaml#L38
It should be overwritten with your actual project id
That was the issue! Thanks!
Same error here. I see that the TLS cert does not match the hostname: gmp-operator.gke-gmp-system.svc vs gmp-operator.gmp-system.svc. Ther's an extra gke
in the deployed operator using gcloud beta container clusters update $CLUSTER_NAME --enable-managed-prometheus --zone $ZONE
I suspect the certificate signing request needs updating
from https://cloud.google.com/stackdriver/docs/managed-prometheus/setup-managed#gmp-pod-monitoring
kubectl -n gmp-test apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.4.3-gke.0/examples/pod-monitoring.yaml
Error from server (InternalError): error when creating "https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.4.3-gke.0/examples/pod-monitoring.yaml": Internal error occurred: failed calling webhook "default.podmonitorings.gmp-operator.gke-gmp-system.monitoring.googleapis.com": failed to call webhook: Post "https://gmp-operator.gke-gmp-system.svc:443/default/monitoring.googleapis.com/v1/podmonitorings?timeout=10s": x509: certificate is valid for gmp-operator, gmp-operator.gmp-system, gmp-operator.gmp-system.svc, not gmp-operator.gke-gmp-system.svc
Namespaces:
kubectl get namespaces
NAME STATUS AGE
default Active 19d
gke-gmp-system Active 16h
gmp-public Active 16h
gmp-test Active 7m31s
kube-node-lease Active 19d
kube-public Active 19d
kube-system Active 19d
load-test Active 19d
Any workaround?
@avbenavides not yet. subscribe to comments in #300
Closing this issue as it's become a conflation of several issues - the last of which is captured in #300.