prometheus-engine [BUG] apply PodMonitoring | timeout=10s": context deadline exceeded

[BUG] apply PodMonitoring | timeout=10s": context deadline exceeded - GKE autopilot

Open raphaelauv opened this issue 2 years ago • 16 comments

kubectl -n gmp-test apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.3.1/examples/pod-monitoring.yaml

Error from server (InternalError): error when creating "https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.3.1/examples/pod-monitoring.yaml": Internal error occurred: failed calling webhook "default.podmonitorings.gmp-operator.gmp-system.monitoring.googleapis.com": Post "https://gmp-operator.gmp-system.svc:8443/default/monitoring.googleapis.com/v1alpha1/podmonitorings?timeout=10s": context deadline exceeded

Apr 07 '22 14:04 raphaelauv

Hello,

All PodMonitoring CRs are checked with a validating webhook that is hosted in the operator pod. A few questions:

Is the operator pod running without errors?
Do you have a firewall rule preventing the control plane from talking to the operator pod?

Apr 12 '22 16:04 pintohutch

Hello , yes the operator pod is running without error

I do not have any firewall rule

my gke is a GKE auto pilot and there is nothing custom installed inside

and I jus followed this instructions -> https://github.com/GoogleCloudPlatform/prometheus-engine/issues/148#issuecomment-1055599211

https://cloud.google.com/stackdriver/docs/managed-prometheus/setup-managed

Apr 12 '22 16:04 raphaelauv

There was a similar issue at GoogleCloudPlatform/prometheus-engine#178.

Could you try the instructions from the last comment to see if that helps?

Apr 12 '22 17:04 pintohutch

I'm seeing something similar when trying to update the PodMonitoring resources in our GKE cluster. The result of this command:

kubectl get deploy gmp-operator -ngmp-system -ojsonpath="{.spec.template.metadata.annotations['components.gke.io/component-version']}"

is 0.2.9

When trying to run:

kubectl apply -f pod-monitoring.yaml,

I get the following error message:

Error from server (InternalError): error when creating "pod-monitoring.yaml": Internal error occurred: failed calling webhook "default.podmonitorings.gmp-operator.gmp-system.monitoring.googleapis.com": failed to call webhook: Post "https://gmp-operator.gmp-system.svc:443/default/monitoring.googleapis.com/v1/podmonitorings?timeout=10s": no service port 443 found for service "gmp-operator"

Note the port in the error message - it's looking for the gmp-operator service running on port 443. Which is really strange, because the gmp-operator exposes port 8443.

If I manually change the gmp-operator service to listen to port 443, then the above command to create the PodMonitoring resource works.

We also haven't changed anything in our network configuration & we already had the 8443 firewall rule added to our network configuration. We had previously successfully created PodMonitoring objects in our cluster on ~10th of May.

May 31 '22 14:05 bogdan-dumitrescu

Hi @bogdan-dumitrescu - thanks for reporting. We recently changed the gmp-operator's port to 443 to avoid having people creating firewall rules. In your case, it seems that a partial-update happened of the component, where the webhooks were updated but the service and deployment were not.

Let me dig into this. I'm curious, does the manual fix get overwritten by the GKE control plane shortly after? Or does your manual fix persist?

May 31 '22 14:05 pintohutch

@pintohutch Thanks for looking into this! As far as I can see, the manual fix does not get overwritten.

May 31 '22 14:05 bogdan-dumitrescu

We ran into a similar problem

kubectl -n gmp-test apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.4.1/examples/pod-monitoring.yaml
Error from server (InternalError): error when creating "https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.4.1/examples/pod-monitoring.yaml": Internal error occurred: failed calling webhook "default.podmonitorings.gmp-operator.gmp-system.monitoring.googleapis.com": failed to call webhook: Post "https://gmp-operator.gmp-system.svc:8443/default/monitoring.googleapis.com/v1/podmonitorings?timeout=10s": context deadline exceeded

We are running a private GKE cluster.

How do we enable this for private clusters?

May 31 '22 17:05 naveensrinivasan

Hi @naveensrinivasan - for private clusters you need to open a firewall as documented here: https://cloud.google.com/stackdriver/docs/managed-prometheus/troubleshooting#control-plane-firewall

With our next release, this shouldn't be a requirement btw.

@bogdan-dumitrescu - what happens if you run

kubectl get customresourcedefinition podmonitorings.monitoring.googleapis.com -ojsonpath="{.metadata.annotations['components\.gke\.io/component-version']}"

Would you be able to provide your cluster version? And is this a private cluster? I'm curious if there's any firewall rules blocking those ports.

May 31 '22 20:05 pintohutch

@pintohutch I get 0.2.10 if I run the above command. We're running a private cluster, version 1.23.6-gke.1700. Other than the allowing the 8443 port, we did not configure any custom firewall rules.

Jun 01 '22 07:06 bogdan-dumitrescu

Gotcha. Thanks for calling this to our attention @bogdan-dumitrescu! There's a minor bug with our upgrade procedure that upgrades everything except for the operator deployment in some cases, which affected you.

I'm glad that you found a remediation to unblock you. We should have a fix for this rolled out soon. Stay tuned.

Jun 01 '22 14:06 pintohutch

https://cloud.google.com/stackdriver/docs/managed-prometheus/troubleshooting#control-plane-firewall

That worked. Thanks

Jun 01 '22 21:06 naveensrinivasan

@pintohutch

Running into forbidden errors for Prometheus UI and Grafana in Private GKE Clusters

I have the firewall opened.

Jun 08 '22 14:06 naveensrinivasan

Is this set to your actual project id? https://github.com/GoogleCloudPlatform/prometheus-engine/blob/main/examples/frontend.yaml#L38

It should be overwritten with your actual project id

Jun 08 '22 15:06 pintohutch

Is this set to your actual project id? https://github.com/GoogleCloudPlatform/prometheus-engine/blob/main/examples/frontend.yaml#L38

It should be overwritten with your actual project id

That was the issue! Thanks!

Jun 08 '22 17:06 naveensrinivasan

Same error here. I see that the TLS cert does not match the hostname: gmp-operator.gke-gmp-system.svc vs gmp-operator.gmp-system.svc. Ther's an extra gke in the deployed operator using gcloud beta container clusters update $CLUSTER_NAME --enable-managed-prometheus --zone $ZONE I suspect the certificate signing request needs updating

from https://cloud.google.com/stackdriver/docs/managed-prometheus/setup-managed#gmp-pod-monitoring

kubectl -n gmp-test apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.4.3-gke.0/examples/pod-monitoring.yaml

Error from server (InternalError): error when creating "https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.4.3-gke.0/examples/pod-monitoring.yaml": Internal error occurred: failed calling webhook "default.podmonitorings.gmp-operator.gke-gmp-system.monitoring.googleapis.com": failed to call webhook: Post "https://gmp-operator.gke-gmp-system.svc:443/default/monitoring.googleapis.com/v1/podmonitorings?timeout=10s": x509: certificate is valid for gmp-operator, gmp-operator.gmp-system, gmp-operator.gmp-system.svc, not gmp-operator.gke-gmp-system.svc

Namespaces:

kubectl get namespaces                                              

NAME              STATUS   AGE
default           Active   19d
gke-gmp-system    Active   16h
gmp-public        Active   16h
gmp-test          Active   7m31s
kube-node-lease   Active   19d
kube-public       Active   19d
kube-system       Active   19d
load-test         Active   19d

Any workaround?

Aug 24 '22 07:08 avbenavides

@avbenavides not yet. subscribe to comments in #300

Aug 24 '22 07:08 iamolegga

Closing this issue as it's become a conflation of several issues - the last of which is captured in #300.

Sep 12 '22 01:09 pintohutch

prometheus-engine prometheus-engine copied to clipboard

[BUG] apply PodMonitoring | timeout=10s": context deadline exceeded - GKE autopilot

prometheus-engine
prometheus-engine copied to clipboard