helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

[kube-prometheus-stack] Helm-upgrade fails due to admission-webhooks being unable to reach service

Open Jaskaranbir opened this issue 3 years ago • 37 comments

Describe the bug It seems like admission-webhooks arent patched to ignore failures during Helm-upgrades. This causes admission-webhooks to fail while the pod is being patched in single-replica deployments.

Version of Helm and Kubernetes: Helm Version:

$ helm version
version.BuildInfo{Version:"v3.5.4", GitCommit:"1b5edb69df3d3a08df77c9902dc17af864ff05d1", GitTreeState:"clean", GoVersion:"go1.15.11"}

Kubernetes Version:

$ kubectl version
v1.19.9-gke.1900

Which chart: kube-prometheus-stack

Which version of the chart: 16.1.2

What happened:

Error: UPGRADE FAILED: cannot patch "kube-prometheus-kube-prome-general.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://kube-prometheus-kube-prome-operator.test-monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": no service port 443 found for service "kube-prometheus-kube-prome-operator" && cannot patch "kube-prometheus-kube-prome-k8s.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://kube-prometheus-kube-prome-operator.test-monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": no service port 443 found for service "kube-prometheus-kube-prome-operator" && cannot patch "kube-prometheus-kube-prome-kube-apiserver-availability.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://kube-prometheus-kube-prome-operator.test-monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": no service port 443 found for service "kube-prometheus-kube-prome-operator" && cannot patch "kube-prometheus-kube-prome-kube-apiserver-slos" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://kube-prometheus-kube-prome-operator.test-monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": no service port 443 found for service "kube-prometheus-kube-prome-operator" && cannot patch "kube-prometheus-kube-prome-kube-apiserver.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://kube-prometheus-kube-prome-operator.test-monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": no service port 443 found for service "kube-prometheus-kube-prome-operator" && cannot patch "kube-prometheus-kube-prome-kube-prometheus-general.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://kube-prometheus-kube-prome-operator.test-monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": no service port 443 found for service "kube-prometheus-kube-prome-operator" && cannot patch "kube-prometheus-kube-prome-kube-prometheus-node-recording.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://kube-prometheus-kube-prome-operator.test-monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": no service port 443 found for service "kube-prometheus-kube-prome-operator" && cannot patch "kube-prometheus-kube-prome-kube-scheduler.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://kube-prometheus-kube-prome-operator.test-monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": no service port 443 found for service "kube-prometheus-kube-prome-operator" && cannot patch "kube-prometheus-kube-prome-kube-state-metrics" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://kube-prometheus-kube-prome-operator.test-monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": no service port 443 found for service "kube-prometheus-kube-prome-operator" && cannot patch "kube-prometheus-kube-prome-kubelet.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://kube-prometheus-kube-prome-operator.test-monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": no service port 443 found for service "kube-prometheus-kube-prome-operator" && cannot patch "kube-prometheus-kube-prome-kubernetes-apps" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://kube-prometheus-kube-prome-operator.test-monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": no service port 443 found for service "kube-prometheus-kube-prome-operator" && cannot patch "kube-prometheus-kube-prome-kubernetes-resources" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://kube-prometheus-kube-prome-operator.test-monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": no service port 443 found for service "kube-prometheus-kube-prome-operator" && cannot patch "kube-prometheus-kube-prome-kubernetes-storage" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://kube-prometheus-kube-prome-operator.test-monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": no service port 443 found for service "kube-prometheus-kube-prome-operator" && cannot patch "kube-prometheus-kube-prome-kubernetes-system-apiserver" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://kube-prometheus-kube-prome-operator.test-monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": no service port 443 found for service "kube-prometheus-kube-prome-operator" && cannot patch "kube-prometheus-kube-prome-kubernetes-system-controller-manager" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://kube-prometheus-kube-prome-operator.test-monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": no service port 443 found for service "kube-prometheus-kube-prome-operator" && cannot patch "kube-prometheus-kube-prome-kubernetes-system-kubelet" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://kube-prometheus-kube-prome-operator.test-monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": no service port 443 found for service "kube-prometheus-kube-prome-operator" && cannot patch "kube-prometheus-kube-prome-kubernetes-system-scheduler" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://kube-prometheus-kube-prome-operator.test-monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": no service port 443 found for service "kube-prometheus-kube-prome-operator" && cannot patch "kube-prometheus-kube-prome-kubernetes-system" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://kube-prometheus-kube-prome-operator.test-monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": no service port 443 found for service "kube-prometheus-kube-prome-operator" && cannot patch "kube-prometheus-kube-prome-node-exporter.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://kube-prometheus-kube-prome-operator.test-monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": no service port 443 found for service "kube-prometheus-kube-prome-operator" && cannot patch "kube-prometheus-kube-prome-node-exporter" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://kube-prometheus-kube-prome-operator.test-monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": no service port 443 found for service "kube-prometheus-kube-prome-operator" && cannot patch "kube-prometheus-kube-prome-node-network" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://kube-prometheus-kube-prome-operator.test-monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": no service port 443 found for service "kube-prometheus-kube-prome-operator" && cannot patch "kube-prometheus-kube-prome-node.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://kube-prometheus-kube-prome-operator.test-monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": no service port 443 found for service "kube-prometheus-kube-prome-operator" && cannot patch "kube-prometheus-kube-prome-prometheus-operator" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://kube-prometheus-kube-prome-operator.test-monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": no service port 443 found for service "kube-prometheus-kube-prome-operator" && cannot patch "kube-prometheus-kube-prome-prometheus" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://kube-prometheus-kube-prome-operator.test-monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": no service port 443 found for service "kube-prometheus-kube-prome-operator"

What you expected to happen:

Helm-upgrade to be completed successfully.

How to reproduce it (as minimally and precisely as possible):

Upgrade Helm-chart, seems to occur randomly. Sometimes the upgrade works.

Changed values of values.yaml (only put values which differ from the defaults):

values.yaml

defaultRules:
  rules:
    alertmanager: false
    etcd: false

alertmanager:
  enabled: false

coreDns:
  enabled: false

kubeDns:
  enabled: false

kubeEtcd:
  enabled: false

prometheusOperator:
  tls:
    enabled: false

kubeStateMetrics:
  enabled: false

nodeExporter:
  enabled: false

prometheus:
  prometheusSpec:
    serviceMonitorSelectorNilUsesHelmValues: false

The helm command that you execute and failing/misfunctioning:

For example:

helm upgrade --install my-release prometheus-community/kube-prometheus-stack --version 16.1.2 --values values.yaml

Helm values set after installation/upgrade:

alertmanager:
  enabled: false
coreDns:
  enabled: false
defaultRules:
  rules:
    alertmanager: false
    etcd: false
kubeDns:
  enabled: false
kubeEtcd:
  enabled: false
kubeStateMetrics:
  enabled: false
nodeExporter:
  enabled: false
prometheus:
  prometheusSpec:
    serviceMonitorSelectorNilUsesHelmValues: false
prometheusOperator:
  tls:
    enabled: false

Anything else we need to know:

Something to keep in mind is that the Prometheus cluster is single-replica, so there's no highly available upgrades.

We currently use this patch-job to setup certs/failure-policy post install: https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/templates/prometheus-operator/admission-webhooks/job-patch/job-patchWebhook.yaml
This same job can be used to setup just failure-policy (to Ignore) pre-install.

Currently using a temporary workaround for disabling admission-webhook failures before install which resolved the issue:

kubectl get ValidatingWebhookConfiguration -o yaml \
  | sed "s/failurePolicy: Fail/failurePolicy: Ignore/" \
  | kubectl apply -f -
kubectl get MutatingWebhookConfiguration -o yaml \
  | sed "s/failurePolicy: Fail/failurePolicy: Ignore/" \
  | kubectl apply -f -

But a more permanent solution within chart would be great. I dont mind creating a PR for this if the community agrees this to be a viable solution.

Jaskaranbir avatar Jun 04 '21 14:06 Jaskaranbir

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale[bot] avatar Jul 06 '21 09:07 stale[bot]

We are facing the same problem. K8s v1.17 Chart: 16.10.0

tlperini avatar Jul 09 '21 13:07 tlperini

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale[bot] avatar Aug 08 '21 13:08 stale[bot]

having the same problem here.

k8s v.1.21.2 Chart: 18.0.0

rafaribe avatar Aug 20 '21 11:08 rafaribe

same problem k8s v1.22.1 Chart: 18.0.3 - seems like this started sometime in v17

ruckc avatar Sep 03 '21 11:09 ruckc

Same problem here.

k8s v1.21.4 Chart: 18.0.3

timbrd avatar Sep 05 '21 05:09 timbrd

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale[bot] avatar Oct 05 '21 15:10 stale[bot]

+1

AndrewSav avatar Oct 18 '21 00:10 AndrewSav

I also have this problem

k8s: v1.18.8 helm: v3.6.3

TheGthr avatar Oct 26 '21 13:10 TheGthr

+1

helm: 3.7.1 k8s server: 1.20.7 kubectl: 1.22.3

ghost avatar Nov 17 '21 20:11 ghost

I regularly see this as well

Chart/app: kube-prometheus-stack-18.1.0/0.50.0 Helm: v3.5.4 K8S: 1.18 kubectl: v1.20.2

linxcat avatar Nov 29 '21 17:11 linxcat

+1

helm: v3.5.4 k8s: v1.21.2 kubectl: v1.21.0

yum-dev avatar Dec 01 '21 15:12 yum-dev

Any update here. I am also getting same error.

infa-mlagad avatar Dec 09 '21 07:12 infa-mlagad

Please check, if this helps. https://github.com/prometheus-community/helm-charts/issues/108

infa-mlagad avatar Dec 09 '21 10:12 infa-mlagad

same issue here 👍

k8S version : Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:48:33Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:42:41Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}

SelimBF avatar Dec 10 '21 20:12 SelimBF

some on our side

eksctl: v1.22.4
k8s: v1.21.2
helm: v3.7.2"

vadlungu avatar Dec 16 '21 15:12 vadlungu

@vadlungu if you are trying to install promotheus visit my github repo you find the solution

SelimBF avatar Dec 16 '21 20:12 SelimBF

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale[bot] avatar Jan 16 '22 00:01 stale[bot]

/remove stale

AndrewSav avatar Jan 16 '22 04:01 AndrewSav

same issue k8s: 1.23 helm: 3.8.0

yozhsh avatar Jan 28 '22 14:01 yozhsh

Hi, thanks @Jaskaranbir for the provided workaround 👍.

For those who want to execute the associated commands please be aware that they act on all validatingwebhookconfiguration and mutatingwebhookconfiguration resources which is probably not what you want in most cases (especially if you have a huge Kubernetes cluster).

Pick only the resources associated to the kube-prometheus-stack instead as explained in comment https://github.com/prometheus-community/helm-charts/issues/108#issuecomment-825689328

In my case I executed the following commands.

# Check existing objects
kubectl get validatingwebhookconfiguration -A
kubectl get mutatingwebhookconfiguration -A

# Ignore failures
kubectl get validatingwebhookconfiguration NAME_RETURNED_IN_PREVIOUS_COMMAND -o yaml | sed "s/failurePolicy: Fail/failurePolicy: Ignore/" | kubectl apply -f -
kubectl get mutatingwebhookwonfiguration NAME_RETURNED_IN_PREVIOUS_COMMAND  -o yaml | sed "s/failurePolicy: Fail/failurePolicy: Ignore/" | kubectl apply -f -

# Perform Helm changes ...

# Revert policy changes
kubectl get validatingwebhookconfiguration NAME_RETURNED_IN_PREVIOUS_COMMAND -o yaml | sed "s/failurePolicy: Ignore/failurePolicy: Fail/" | kubectl apply -f -
kubectl get mutatingwebhookwonfiguration NAME_RETURNED_IN_PREVIOUS_COMMAND  -o yaml | sed "s/failurePolicy: Ignore/failurePolicy: Fail/" | kubectl apply -f -

bgaillard avatar Feb 07 '22 18:02 bgaillard

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale[bot] avatar Mar 19 '22 22:03 stale[bot]

/remove stale

AndrewSav avatar Mar 20 '22 00:03 AndrewSav

We are facing the same issue, frequently having to ignore the webhooks to proceed with small upgrades

  • kube-prometheus-stack 34.5.1
  • helm 3.8.0
  • kubernetes v1.22.5

MisterTimn avatar Mar 30 '22 10:03 MisterTimn

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale[bot] avatar Apr 29 '22 22:04 stale[bot]

/remove stale

AndrewSav avatar Apr 30 '22 12:04 AndrewSav

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale[bot] avatar May 31 '22 02:05 stale[bot]

/remove stale

AndrewSav avatar May 31 '22 04:05 AndrewSav

I havefaced the same probelm using after deleting all grafana manifest and redeploy again using:

Helm install garafana grafana/grafana --set persistance.enabled=True

SelimBF avatar May 31 '22 09:05 SelimBF

You have to consider that "helm update install ...."doesnt work in my case and the pod crashed by the way

SelimBF avatar May 31 '22 09:05 SelimBF