kubermatic Addons getting installed before CNI and Nodes are setup

What happened?

Reported internally:

failed to reconcile Addon "default-storage-class": failed to deploy the addon manifests into the cluster: failed to execute '/usr/local/bin/kubectl-1.25 --kubeconfig /tmp/cluster-t8xfkqq88x-addon-default-storage-class-kubeconfig apply --prune --filename /tmp/cluster-t8xfkqq88x-default-storage-class.yaml --selector kubermatic-addon=default-storage-class' for addon default-storage-class of cluster t8xfkqq88x: exit status 1 storageclass.storage.k8s.io/cinder-csi unchanged Error from server (InternalError): error when creating "/tmp/cluster-t8xfkqq88x-default-storage-class.yaml": Internal error occurred: failed calling webhook "validation-webhook.snapshot.storage.k8s.io": failed to call webhook: Post "https://snapshot-validation-service.kube-system.svc:443/volumesnapshot?timeout=2s": dial tcp 10.240.31.108:443: connect: operation not permitted

There is a race condition between cluster setup, CNI application installation, and addons installation resulting in a potentially high volume of reported but unnecessary errors that eventually go away.

This particular situation with connect: operation not permitted for validating webhook happens frequently for OpenStack with cilium CNI. The root cause is that kkp-addon-controller tries to apply the default VolumeSnapshotClass before the validation-webhook.snapshot.storage.k8s.io webhook pods snapshot-validation-deployment are available

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  16m   default-scheduler  no nodes available to schedule pods
  Warning  FailedScheduling  16m   default-scheduler  no nodes available to schedule pods
  Normal   Scheduled         12m   default-scheduler  Successfully assigned kube-system/snapshot-validation-deployment-75c6757f97-6fbtd to v8hp9q68c2-worker-9mfglf-69f5f6dc54-5mk8p
  Normal   Pulling           12m   kubelet            Pulling image "harbor.kubermatic-admin.aec.arvato-cloud.de/kubermatic/sig-storage/snapshot-validation-webhook:v6.0.1"
  Normal   Pulled            12m   kubelet            Successfully pulled image "harbor.kubermatic-admin.aec.arvato-cloud.de/kubermatic/sig-storage/snapshot-validation-webhook:v6.0.1" in 1.411594135s (7.830546564s including waiting)
  Normal   Created           12m   kubelet            Created container snapshot-validation
  Normal   Started           12m   kubelet            Started container snapshot-validation

Dynamic admission control rejects the resource because no pods can validate it resulting in 25 errors in 4 minutes.

Similar race condition with connect: connection refused for canal Another similar race condition with No agent available if MD is not ready and there are no nodes to even run the webhook pods.

Expected behavior

Addons should be installed after MD and CNI are ready to reduce high volume of false alerts.

How to reproduce the issue?

Create a new OpenStack cluster and observe high-volume of error regarding applying default VolumeSnapshotClass from a default OpenStack addon https://github.com/kubermatic/kubermatic/blob/v2.22.3/addons/default-storage-class/snapshot-class.yaml#L18

Similar behaviour can be mimicked by scaling down md or deployment

$ k scale md --replicas 1 -nkube-system fvmzkb87fq-worker-rtc9f
machinedeployment.cluster.k8s.io/fvmzkb87fq-worker-rtc9fz scaled

$ k create -f volumesnapshotclasses.cinder-csi.yaml
Error from server (InternalError): error when creating "volumesnapshotclasses.cinder-csi.yaml": Internal error occurred: failed calling webhook "validation-webhook.snapshot.storage.k8s.io": failed to call webhook: Post "https://snapshot-validation-service.kube-system.svc:443/volumesnapshot?timeout=2s": No agent available

$ k scale -nkube-system deployment/snapshot-validation-deployment --replicas=0
deployment.apps/snapshot-validation-deployment scaled

$ k create -f volumesnapshotclasses.cinder-csi.yaml
Error from server (InternalError): error when creating "volumesnapshotclasses.cinder-csi.yaml": Internal error occurred: failed calling webhook "validation-webhook.snapshot.storage.k8s.io": failed to call webhook: Post "https://snapshot-validation-service.kube-system.svc:443/volumesnapshot?timeout=2s": dial tcp 10.240.21.134:443: connect: operation not permitted

How is your environment configured?

KKP version: 2.22.3
Shared or separate master/seed clusters?: irrelevant

Provide your KKP manifest here (if applicable)

irrelevant

# paste manifest here

What cloud provider are you running on?

OpenStack but probably irrelevant

What operating system are you running in your user cluster?

irrelevant

Additional information

May 23 '23 12:05 wozniakjan

Issues go stale after 90d of inactivity. After a furter 30 days, they will turn rotten. Mark the issue as fresh with /remove-lifecycle stale.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Aug 31 '23 11:08 kubermatic-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

Sep 30 '23 11:09 kubermatic-bot

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

Nov 01 '23 23:11 kubermatic-bot

@kubermatic-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Nov 01 '23 23:11 kubermatic-bot

/reopen

Nov 02 '23 07:11 embik

@embik: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Nov 02 '23 07:11 kubermatic-bot

/remove-lifecycle rotten

Nov 02 '23 07:11 embik

Can we postpone addon installation? And if so, what's the event we'd be waiting for? Nodes to be ready? What if the cluster has no nodes? What if your KKP setup needs a custom addon to make things work?

Jun 26 '24 16:06 xrstf