kops cni migration and the current docs state

/kind feature

Already checked #1071 and a few other issues.

It would be great if Kops handles changes of a CNI in a safe way. The docs are not quite clear which options are safe to do and what kind of disruption is to be expected: https://github.com/kubernetes/kops/blob/master/docs/networking.md#switching-between-networking-providers. So either the docs need to be updated or the feature really is missing and needs to be implemented. In case kops is not handling the migration automatically, it would at least be great to add a table of common migration paths with pre-migration, migration and post-migration steps.

In the current case I want to migrate from a vanilla Calico installation:

  networking:
    calico: {}

to

  networking:
    cilium: {}

Of course, following the docs, I could deactivate the CNI managed by kOps altogether and install a vanilla Cilium doing

  networking:
    cni: {}

but I would have to manually ensure that Cilium requirements are met by the kOps created cluster.

It would be so much better to keep the CNI management to kOps, especially as Cilium is now the default CNI. I am especially unclear about the meaning of the following in the docs "It is also possible to switch between CNI providers, but this usually is a disruptive change. kOps will also not clean up any resources left behind by the previous CNI, including the CNI daemonset." Does disruption mean a short downtime but the change is usually safe? Will kOps in an already existing cluster handle all needed config changes, like mounting epbf or populate etcd certficates, if selecting a kv store? Or do I have to do those configs myself, as this is only done when first creating a cluster?

For my current case: If Calico to Cilium migration is not safe, but I still want to keep the cni management in kops: Would it also be possible to follow the Cilium migration guide (https://docs.cilium.io/en/latest/installation/k8s-install-migration.html) node-by-node (tried this out already on a non kOps cluster)? What would happen if I then set "networking: cilium: {}"? Would a second Cilium version be deployed or could I deploy my first installation in a way understood and recognized by kOps?

If somebody can give me a hint on the current state I am also more than happy to do a PR for the docs after my migration :)

Aug 22 '24 05:08 rstoermer

Hi @rstoermer! CNI migration is difficult and would require complex orchestration to happen without downtime. At the moment thee is no support for any kind of CNI migration in kOps. That being said, you can try to switch to "cni", delete the Calico components and enable Cilium, but you need to do a rolling-update with cloudonly to clean evertything. The will be downtime and maybe some unexpected surprises, so I suggest to try first on some test cluster(s).

Aug 23 '24 14:08 hakman

Hi @rstoermer! CNI migration is difficult and would require complex orchestration to happen without downtime. At the moment thee is no support for any kind of CNI migration in kOps. That being said, you can try to switch to "cni", delete the Calico components and enable Cilium, but you need to do a rolling-update with cloudonly to clean evertything. The will be downtime and maybe some unexpected surprises, so I suggest to try first on some test cluster(s).

Hi @hakman, alright, thats what I suspected. Thanks for replying so quickly! I will see what I learn and update the issue then.

Aug 28 '24 10:08 rstoermer

@rstoermer Did you find a way to do this migration? Did it involve downtime? I was trying to follow the official Cilium migration doc but they require a Cilium CRD to be present CiliumNodeConfig which is not included with the version of Cilium that kOps uses (I use kOps 1.26.6).

Nov 14 '24 15:11 rojanDinc

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 12 '25 15:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Mar 14 '25 16:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Apr 13 '25 17:04 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Apr 13 '25 17:04 k8s-ci-robot