manifests icon indicating copy to clipboard operation
manifests copied to clipboard

KServe and cert-manager webhooks are failing

Open biswajit-9776 opened this issue 11 months ago • 26 comments

While isntalling Kubeflow using the command:

while ! kustomize build example | awk '!/well-defined/' | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

Some webhooks could not be reached:

Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.218.186:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.218.186:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.218.186:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.218.186:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.218.186:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.218.186:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.218.186:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.218.186:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.218.186:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.218.186:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.218.186:443: connect: connection refused
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "clusterservingruntime.kserve-webhook-server.validator": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "clusterservingruntime.kserve-webhook-server.validator": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "clusterservingruntime.kserve-webhook-server.validator": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "clusterservingruntime.kserve-webhook-server.validator": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "clusterservingruntime.kserve-webhook-server.validator": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "clusterservingruntime.kserve-webhook-server.validator": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "clusterservingruntime.kserve-webhook-server.validator": could not get REST client: unable to load root ce rtificates: unable to parse bytes as PEM block
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "clusterservingruntime.kserve-webhook-server.validator": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "clusterservingruntime.kserve-webhook-server.validator": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block

[biswa@fedora manifests]$ sudo kubectl get endpoints -n cert-manager cert-manager-webhook
NAME                   ENDPOINTS          AGE
cert-manager-webhook   10.244.0.8:10250   108m

The K-serve webhook issue was previously encountered in #2553. Should changes made in #2627 prevent reproducing this error? As for cert-manager webhook, #2585 had problem with no route to host while mine has with refused connection. It could be a kubernetes root level issue or deeper networking stack issue as in https://cert-manager.io/docs/troubleshooting/webhook/#cause-2-eks-on-a-custom-cni

kustomize version:

v5.3.0

My kubectl pods are:

[biswa@fedora manifests]$ sudo kubectl get pods -A
NAMESPACE            NAME                                                              READY   STATUS              RESTARTS       AGE
auth                 dex-5d8fffb998-qq49q                                              1/1     Running             0              94m
cert-manager         cert-manager-5b8f9b9d96-l7vj7                                     1/1     Running             0              94m
cert-manager         cert-manager-cainjector-54f68bfb64-m6x5f                          1/1     Running             0              94m
cert-manager         cert-manager-webhook-f6c8487d6-9x6x4                              1/1     Running             0              94m
istio-system         cluster-local-gateway-7bd9cffcb5-thdkb                            1/1     Running             0              94m
istio-system         configure-kubernetes-oidc-issuer-jwks-in-requestauthenticasxnfl   0/1     Completed           0              94m
istio-system         istio-ingressgateway-666f789ccb-wcqdc                             1/1     Running             0              94m
istio-system         istiod-6cd8c6c59c-htqzn                                           1/1     Running             0              94m
knative-eventing     eventing-controller-688dc8df9f-9fxpp                              1/1     Running             0              94m
knative-eventing     eventing-webhook-8c6cc5bc7-789xh                                  1/1     Running             0              94m
knative-serving      activator-55cd894f6c-dr9q4                                        1/1     Running             8 (36m ago)    94m
knative-serving      autoscaler-76748895b9-shk8t                                       2/2     Running             0              56m
knative-serving      controller-76dcf67d5-7tb5w                                        2/2     Running             0              56m
knative-serving      domain-mapping-f5d4dbc56-pbz5q                                    2/2     Running             0              56m
knative-serving      domainmapping-webhook-6f67684cd8-nlnsf                            2/2     Running             0              55m
knative-serving      net-istio-controller-7bb6fb5f58-tklxs                             2/2     Running             0              55m
knative-serving      net-istio-webhook-7d8476f6-svcjf                                  2/2     Running             0              55m
knative-serving      webhook-d5cbdf855-bzmsx                                           2/2     Running             0              55m
kube-system          coredns-565d847f94-cd9dp                                          1/1     Running             0              96m
kube-system          coredns-565d847f94-lc62z                                          1/1     Running             0              96m
kube-system          etcd-kubeflow-control-plane                                       1/1     Running             0              96m
kube-system          kindnet-qzthr                                                     1/1     Running             0              96m
kube-system          kube-apiserver-kubeflow-control-plane                             1/1     Running             0              96m
kube-system          kube-controller-manager-kubeflow-control-plane                    1/1     Running             0              96m
kube-system          kube-proxy-9zct2                                                  1/1     Running             0              96m
kube-system          kube-scheduler-kubeflow-control-plane                             1/1     Running             0              96m
kubeflow             admission-webhook-deployment-6cf44ffbdb-5m86s                     0/1     ContainerCreating   0              55m
kubeflow             cache-server-7d94c87787-88m4h                                     0/2     Init:0/1            0              55m
kubeflow             centraldashboard-965564b75-6frpk                                  2/2     Running             0              55m
kubeflow             jupyter-web-app-deployment-757976b798-7ngdb                       0/2     Pending             0              55m
kubeflow             katib-controller-64bf8db8bd-nfn2k                                 0/1     ContainerCreating   0              55m
kubeflow             katib-db-manager-6d6885765-tqldd                                  1/1     Running             7 (40m ago)    55m
kubeflow             katib-mysql-db6dc68c-q7hbt                                        1/1     Running             0              55m
kubeflow             katib-ui-64b8f8d78c-vxttm                                         2/2     Running             0              55m
kubeflow             kserve-controller-manager-6df96f6d7c-wwxct                        0/2     ContainerCreating   0              55m
kubeflow             kserve-models-web-app-99849d9f7-rmfhk                             2/2     Running             0              55m
kubeflow             kubeflow-pipelines-profile-controller-59ccbd47b9-7875s            1/1     Running             0              55m
kubeflow             metacontroller-0                                                  1/1     Running             0              94m
kubeflow             metadata-envoy-deployment-5cbbb86fc9-pwpbw                        1/1     Running             0              55m
kubeflow             metadata-grpc-deployment-784b8b5fb4-rqw94                         1/2     CrashLoopBackOff    10 (49s ago)   55m
kubeflow             metadata-writer-844bd5d486-nm2j6                                  2/2     Running             4 (69s ago)    55m
kubeflow             minio-65dff76b66-brflp                                            0/2     Pending             0              55m
kubeflow             ml-pipeline-6c7c86f666-qbs65                                      0/2     PodInitializing     0              55m
kubeflow             ml-pipeline-persistenceagent-85c485f86f-j8qwx                     0/2     PodInitializing     0              55m
kubeflow             ml-pipeline-scheduledworkflow-6448c96f4f-98997                    0/2     PodInitializing     0              55m
kubeflow             ml-pipeline-ui-6db56c647b-b6ksz                                   0/2     Pending             0              55m
kubeflow             ml-pipeline-viewer-crd-5df88b6956-kpt68                           0/2     Pending             0              55m
kubeflow             ml-pipeline-visualizationserver-6d49897f85-p9msj                  0/2     Pending             0              55m
kubeflow             mysql-c999c6c8-phg5s                                              0/2     Pending             0              55m
kubeflow             notebook-controller-deployment-9ffdf65d7-bsn6b                    0/2     PodInitializing     0              55m
kubeflow             profiles-deployment-cbf679dbd-qwskr                               0/3     PodInitializing     0              55m
kubeflow             pvcviewer-controller-manager-d66667b49-mhn4n                      0/3     Pending             0              55m
kubeflow             tensorboard-controller-deployment-7444dc8fcd-gxvfr                0/3     Pending             0              55m
kubeflow             tensorboards-web-app-deployment-78f7c694bf-tp8z9                  0/2     Pending             0              55m
kubeflow             training-operator-69575765df-v9hl4                                1/1     Running             0              55m
kubeflow             volumes-web-app-deployment-6dfccd897d-xklf7                       0/2     Pending             0              55m
kubeflow             workflow-controller-f65c9d9b4-m4f9k                               0/2     PodInitializing     0              55m
local-path-storage   local-path-provisioner-684f458cdd-nvs75                           1/1     Running             0              96m
oauth2-proxy         oauth2-proxy-58d95869bf-5n6l5                                     1/1     Running             0              94m
oauth2-proxy         oauth2-proxy-58d95869bf-684pn                                     1/1     Running             0              94m

biswajit-9776 avatar Mar 20 '24 16:03 biswajit-9776

Can you try with the master branch as well? Please also check whether your install command is up to date in the master branch readme.md and follow the installation instructions with Kind as close as possible.

juliusvonkohout avatar Apr 03 '24 14:04 juliusvonkohout

I was able to resolve this by increasing the resources allocated to the machine. Was getting capped out by CPU, maybe you're facing similar?

dnapier avatar Apr 03 '24 20:04 dnapier

Can you try with the master branch as well? Please also check whether your install command is up to date in the master branch readme.md and follow the installation instructions with Kind as close as possible.

Hey @juliusvonkohout, yes my local machine's master branch is up to date.

biswajit-9776 avatar Apr 04 '24 14:04 biswajit-9776

@dnapier Hi, I tried to increase CPU resources in the --kubeconfig file but it says there is no resources field in v1alpha4.Node. Could you please tell me what you tried?

biswajit-9776 avatar Apr 04 '24 14:04 biswajit-9776

When I ran kubectl describe nodes, the cpu resources were maxed out. This was being done in a VM, so I simply added more cores to the machine. If you're doing the same and the core speeds are being limited by the host, you could raise the limit as well, but that was not the case for me.

image

I encountered another issue following this which was the activator of knative-serving crashing, but I do not believe that is related to the error you're seeing here.

dnapier avatar Apr 04 '24 14:04 dnapier

@dnapier Hi, I tried to increase CPU resources in the --kubeconfig file but it says there is no resources field in v1alpha4.Node. Could you please tell me what you tried?

CC @diegolovison then

juliusvonkohout avatar Apr 08 '24 05:04 juliusvonkohout

Are you using kind with docker ?

diegolovison avatar Apr 08 '24 12:04 diegolovison

Hello guys, I'm facing the same issues. I have to deploy Kubeflow for an Internship project and I have the same problem with Kubeflow v1.8 kustomize version : v5.3.0 cert-manager version : v0.12.1

After : "while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done" I get this error

Capture d'écran 2024-04-09 151931

My Kubernetes cluster is running with Tanzu.

ALPHA-1503 avatar Apr 09 '24 13:04 ALPHA-1503

Please just test with Kind as explained in the readme.md in the master branch, to make sure that it is not a Kubernetes issue of your own cluster.

juliusvonkohout avatar Apr 09 '24 14:04 juliusvonkohout

Are you using kind with docker ?

Sorry, I didn't catch that this was addressed to me. Yes in my case, I am using kind with docker. Debian 12 host.

dnapier avatar Apr 09 '24 16:04 dnapier

What is the amount of CPU and memory that you have available? Were you strictly following https://github.com/kubeflow/manifests/#installation

diegolovison avatar Apr 09 '24 16:04 diegolovison

12GB of memory on the system, 8 core processor (Intel(R) Xeon(R) E5-2620).

And yes I was strictly following the installation instructions.

dnapier avatar Apr 09 '24 22:04 dnapier

Please just test with Kind as explained in the readme.md in the master branch, to make sure that it is not a Kubernetes issue of your own cluster.

I already tested the v1.8 on minikube and I'm facing the same issue...

ALPHA-1503 avatar Apr 10 '24 09:04 ALPHA-1503

12GB of memory on the system, 8 core processor (Intel(R) Xeon(R) E5-2620).

I believe you will need to have more resources. I have 20 cores and 36GB of memory

minikube and I'm facing the same issue...

I wasn't able to make it work on Minikube. Only with kind

diegolovison avatar Apr 10 '24 13:04 diegolovison

I've just attempted to install it using a local kind cluster, but it didn't work. I'm encountering another issue... ! issue-kind-kf

ALPHA-1503 avatar Apr 10 '24 13:04 ALPHA-1503

I've just attempted to install it using a local kind cluster, but it didn't work. I'm encountering another issue... ! issue-kind-kf

That's the exact issue I'm facing which @diegolovison is suggesting is caused from lack of available resources. I'm working on doubling my memory to 24GB to test if that resolves it. Will update asap.

dnapier avatar Apr 10 '24 14:04 dnapier

Interesting.... I managed to install v1.8 on Minikube just now. I'm curious why it's working now. My suspicion is that I might encounter issues installing it on my Tanzu Cluster, perhaps due to a cluster-related problem.

ALPHA-1503 avatar Apr 10 '24 14:04 ALPHA-1503

Interesting.... I managed to install v1.8 on Minikube just now. I'm curious why it's working now. My suspicion is that I might encounter issues installing it on my Tanzu Cluster, perhaps due to a cluster-related problem.

Do you mind sharing your cpu/memory for comparison?

dnapier avatar Apr 10 '24 14:04 dnapier

8 Cores/16G

ALPHA-1503 avatar Apr 10 '24 14:04 ALPHA-1503

minikube with podman worked for me with 16 GB if you strip down the example distribution down a bit. Otherwise you might need 32 GB. @diegolovison , we should add the memory and core requirements on top of the installation instructions with kind.

juliusvonkohout avatar Apr 15 '24 06:04 juliusvonkohout

Do you believe that 32 GB and 20 cores?

diegolovison avatar Apr 15 '24 11:04 diegolovison

Do you believe that 32 GB and 20 cores?

I do not understand your question.

juliusvonkohout avatar Apr 15 '24 13:04 juliusvonkohout

should we document that 32 GB of RAM and 20 CPU cores are the minimal to install Kubeflow locally?

diegolovison avatar Apr 15 '24 13:04 diegolovison

should we document that 32 GB of RAM and 20 CPU cores are the minimal to install Kubeflow locally?

Not that I have a say here, but I think that's a great idea.

dnapier avatar Apr 15 '24 13:04 dnapier

I would go with 16 cores and 32 GB memory as recommendation. Or are you sure that 16 cores are not enough? It is possible to do with way less, but that is then left up to the end user.

juliusvonkohout avatar Apr 15 '24 13:04 juliusvonkohout

Ok. Sounds good

diegolovison avatar Apr 15 '24 13:04 diegolovison

@biswajit-9776 Please retry with the lastest master branch and readme. If you still encounter problems please open a new issue with our new template.

juliusvonkohout avatar May 16 '24 14:05 juliusvonkohout