Addons getting installed before CNI and Nodes are setup
What happened?
Reported internally:
failed to reconcile Addon "default-storage-class": failed to deploy the addon manifests into the cluster: failed to execute '/usr/local/bin/kubectl-1.25 --kubeconfig /tmp/cluster-t8xfkqq88x-addon-default-storage-class-kubeconfig apply --prune --filename /tmp/cluster-t8xfkqq88x-default-storage-class.yaml --selector kubermatic-addon=default-storage-class' for addon default-storage-class of cluster t8xfkqq88x: exit status 1 storageclass.storage.k8s.io/cinder-csi unchanged Error from server (InternalError): error when creating "/tmp/cluster-t8xfkqq88x-default-storage-class.yaml": Internal error occurred: failed calling webhook "validation-webhook.snapshot.storage.k8s.io": failed to call webhook: Post "https://snapshot-validation-service.kube-system.svc:443/volumesnapshot?timeout=2s": dial tcp 10.240.31.108:443: connect: operation not permitted
There is a race condition between cluster setup, CNI application installation, and addons installation resulting in a potentially high volume of reported but unnecessary errors that eventually go away.
This particular situation with connect: operation not permitted for validating webhook happens frequently for OpenStack with cilium CNI. The root cause is that kkp-addon-controller tries to apply the default VolumeSnapshotClass before the validation-webhook.snapshot.storage.k8s.io webhook pods snapshot-validation-deployment are available
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 16m default-scheduler no nodes available to schedule pods
Warning FailedScheduling 16m default-scheduler no nodes available to schedule pods
Normal Scheduled 12m default-scheduler Successfully assigned kube-system/snapshot-validation-deployment-75c6757f97-6fbtd to v8hp9q68c2-worker-9mfglf-69f5f6dc54-5mk8p
Normal Pulling 12m kubelet Pulling image "harbor.kubermatic-admin.aec.arvato-cloud.de/kubermatic/sig-storage/snapshot-validation-webhook:v6.0.1"
Normal Pulled 12m kubelet Successfully pulled image "harbor.kubermatic-admin.aec.arvato-cloud.de/kubermatic/sig-storage/snapshot-validation-webhook:v6.0.1" in 1.411594135s (7.830546564s including waiting)
Normal Created 12m kubelet Created container snapshot-validation
Normal Started 12m kubelet Started container snapshot-validation
Dynamic admission control rejects the resource because no pods can validate it resulting in 25 errors in 4 minutes.
Similar race condition with connect: connection refused for canal
Another similar race condition with No agent available if MD is not ready and there are no nodes to even run the webhook pods.
Expected behavior
Addons should be installed after MD and CNI are ready to reduce high volume of false alerts.
How to reproduce the issue?
Create a new OpenStack cluster and observe high-volume of error regarding applying default VolumeSnapshotClass from a default OpenStack addon
https://github.com/kubermatic/kubermatic/blob/v2.22.3/addons/default-storage-class/snapshot-class.yaml#L18
Similar behaviour can be mimicked by scaling down md or deployment
$ k scale md --replicas 1 -nkube-system fvmzkb87fq-worker-rtc9f
machinedeployment.cluster.k8s.io/fvmzkb87fq-worker-rtc9fz scaled
$ k create -f volumesnapshotclasses.cinder-csi.yaml
Error from server (InternalError): error when creating "volumesnapshotclasses.cinder-csi.yaml": Internal error occurred: failed calling webhook "validation-webhook.snapshot.storage.k8s.io": failed to call webhook: Post "https://snapshot-validation-service.kube-system.svc:443/volumesnapshot?timeout=2s": No agent available
$ k scale -nkube-system deployment/snapshot-validation-deployment --replicas=0
deployment.apps/snapshot-validation-deployment scaled
$ k create -f volumesnapshotclasses.cinder-csi.yaml
Error from server (InternalError): error when creating "volumesnapshotclasses.cinder-csi.yaml": Internal error occurred: failed calling webhook "validation-webhook.snapshot.storage.k8s.io": failed to call webhook: Post "https://snapshot-validation-service.kube-system.svc:443/volumesnapshot?timeout=2s": dial tcp 10.240.21.134:443: connect: operation not permitted
How is your environment configured?
- KKP version: 2.22.3
- Shared or separate master/seed clusters?: irrelevant
Provide your KKP manifest here (if applicable)
irrelevant
# paste manifest here
What cloud provider are you running on?
OpenStack but probably irrelevant
What operating system are you running in your user cluster?
irrelevant
Additional information
Issues go stale after 90d of inactivity.
After a furter 30 days, they will turn rotten.
Mark the issue as fresh with /remove-lifecycle stale.
If this issue is safe to close now please do so with /close.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
/close
@kubermatic-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity. Reopen the issue with
/reopen. Mark the issue as fresh with/remove-lifecycle rotten./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen
@embik: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/remove-lifecycle rotten
Can we postpone addon installation? And if so, what's the event we'd be waiting for? Nodes to be ready? What if the cluster has no nodes? What if your KKP setup needs a custom addon to make things work?