enhancements icon indicating copy to clipboard operation
enhancements copied to clipboard

KNI [Kubernetes Networking Interface] Initial Draft KEP

Open MikeZappa87 opened this issue 5 months ago • 23 comments

  • One-line PR description: This is the first draft of the KNI KEP, user stories and additions to be discussed as a community
  • Issue link: https://github.com/kubernetes/enhancements/issues/4410
  • Other comments:

MikeZappa87 avatar Feb 02 '24 22:02 MikeZappa87

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

k8s-ci-robot avatar Feb 02 '24 22:02 k8s-ci-robot

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MikeZappa87

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Feb 02 '24 22:02 k8s-ci-robot

This is a mix of too-high-level and too-low-level.

We need:

Further explanation:

We already have a working pod networking architecture. Therefore, there is a strong bias toward not throwing that away and completely replacing it with something else. If you want to convince people to replace the current pod networking architecture, you need to explain why KNI is better, and in order to do that, you have to clearly articulate what problems you see with the current system, and how KNI addresses those problems (and then we can decide whether we agree that those are problems, and whether KNI is a good way to solve those problems).

It's not enough to show that KNI is a good idea. You need to show that throwing away our working code and replacing it with KNI is a better idea than not throwing away our working code is.

(This is basically the same problem that kpng had; they spent too much time writing new code and not enough time trying to explain why we would want to use that code.)

danwinship avatar Feb 05 '24 15:02 danwinship

This is a mix of too-high-level and too-low-level. We need:

Further explanation:

We already have a working pod networking architecture. Therefore, there is a strong bias toward not throwing that away and completely replacing it with something else. If you want to convince people to replace the current pod networking architecture, you need to explain why KNI is better, and in order to do that, you have to clearly articulate what problems you see with the current system, and how KNI addresses those problems (and then we can decide whether we agree that those are problems, and whether KNI is a good way to solve those problems).

It's not enough to show that KNI is a good idea. You need to show that throwing away our working code and replacing it with KNI is a better idea than not throwing away our working code is.

(This is basically the same problem that kpng had; they spent too much time writing new code and not enough time trying to explain why we would want to use that code.)

"not throwing that away and completely replacing it with something else" this is something I will push back against. What specifically are we throwing away? Are you referring to the CNI? The how we go about this is up for debate as well so we would need to wait for that to make a statement.

MikeZappa87 avatar Feb 05 '24 17:02 MikeZappa87

this is something I will push back against

You're missing the point.

Kubernetes pod networking already works. Why do we need KNI then? KNI will presumably be a non-zero amount of work to implement, and have a non-zero percent change of introducing bugs. Why do we want to spend our time doing this work rather than working on somethign else? Why are we willing to risk introducing bugs when we could just do nothing and then not have new bugs?

Specifically. Not "but we need to evolve" or "we can make things better". Why do we want to evolve in this exact direction as opposed to some other direction? What exactly will we be making better?

Honestly, I literally still don't know what your goal with KNI is. I can see what you've done, but I don't know why you've done what you've done (as opposed to, say, CNI 2.0, or Multi-Network).

danwinship avatar Feb 05 '24 21:02 danwinship

Let me write down my thoughts,

Kubernetes has becomes a magnet for the industry and now everyone wants to run their workloads in kubernetes, what it is worse, these new environments also to try to modify the core without understanding the current model or project state to absorb their problem domains, causing panic on the small group of maintainers that know how disruptive can this be, and people with similar past experiences that saw how these hypes comes and go and leave their changes orphan if they can not serve their purposes anymore :

  • AI/ML has some crazy network devices that run terabits per second but they are tied to the GPUs and use complex libraries to build the network topologies that are not related to kubernetes at all, kubernetes just should be able to plug existing network devices in pods , this not even require status, DRA tries to solve this AFAIK
  • Telcos want to create Pods modeled as routers or firewall appliances, multus solves this problem, but has to rely on parsing json on annotations and doing the previously mentioned apiserver dependency complexity
  • Service meshes needs to inject sidecards and do some pre and post operations between the pod network is setup to divert traffic
  • kubevirt like projects want to build and IaaS over kubernetes and want to create all the network virtual infrastructure, network tenancy, the Pod as a VM thing ...

I personally feel very strong that core kubernetes networking will remain the same, is a solid and well known model that works very well, we have very low number of bugs, it is complex, but is not as complex as other projects, try to dive into the networking of an IaaS in a cloud provider or an OSS project, that is complex. However, Kubernetes was always pluggable and extensible, so if people wants to do complex things with the network we may think how can they do it without breaking the core networking, that is the line I think we should explore, can we create a better network interface so all these communities can build on top of Kubernetes without breaking the project?

In kubernetes networking we don't have a good interface between the network an other components like device plugins, CSI drivers ... the network now is behind the runtime and we are with the CNI2.0 discussion forever, @squeed had presented some draft about enhancing CRI. New workloads need to do gymnastics with annotations or CRDs and create complex dependencies between components using the control plane as a communication bus, that makes the network solutions hard to support and unreliable, as CreatePod execs CNI plugin that calls apiserver and somehow is ratelimited or wrong auth and fails to create a pod in a crashloop ... can we do better? can we solve some of the existing problems?

any of this has to happen without a kubernetes 2.0 mindset of " ok , let's throw it everything and start over", or "there are 5 competing standards, lets's create a new one" ... we need to dig deeper into what are fundamental problems and what are nice to have things, per example

multinetwork as multiple interfaces has a lot of problems solved with multus, is just that the implementation needs to workaround our lack of interfaces and that causes problems, are those fundamental problems or is that good enough?

Multinetwork as a Pods with multiple networks and multitenancy as in an IaaS to me is clear not a problem for kubernetes core to solve, kubevirt has demonstrated that and we already provide the extensions necessary to create that out of core.

AI/ML need to plug-in netdevices, OCI spec only considers block devices, https://github.com/opencontainers/runtime-spec/issues/1239, that are different for the linux kernel

Services meshes need to hook into CNI to diver traffic, is that a problem of Service meshes not implementing CNI interface directly? or is a layering problem on the network provisioning?

As I said, I have a lot of questions and I can see lot of requests and solutions that still don't answer my questions, I think we should start having first a good understanding of the problems, categorize them and then think in solutions for those problems

aojea avatar Feb 07 '24 10:02 aojea

Let me write down my thoughts,

Kubernetes has becomes a magnet for the industry and now everyone wants to run their workloads in kubernetes, what it is worse, these new environments also to try to modify the core without understanding the current model or project state to absorb their problem domains, causing panic on the small group of maintainers that know how disruptive can this be, and people with similar past experiences that saw how these hypes comes and go and leave their changes orphan if they can not serve their purposes anymore :

  • AI/ML has some crazy network devices that run terabits per second but they are tied to the GPUs and use complex libraries to build the network topologies that are not related to kubernetes at all, kubernetes just should be able to plug existing network devices in pods , this not even require status, DRA tries to solve this AFAIK
  • Telcos want to create Pods modeled as routers or firewall appliances, multus solves this problem, but has to rely on parsing json on annotations and doing the previously mentioned apiserver dependency complexity
  • Service meshes needs to inject sidecards and do some pre and post operations between the pod network is setup to divert traffic
  • kubevirt like projects want to build and IaaS over kubernetes and want to create all the network virtual infrastructure, network tenancy, the Pod as a VM thing ...

I personally feel very strong that core kubernetes networking will remain the same, is a solid and well known model that works very well, we have very low number of bugs, it is complex, but is not as complex as other projects, try to dive into the networking of an IaaS in a cloud provider or an OSS project, that is complex. However, Kubernetes was always pluggable and extensible, so if people wants to do complex things with the network we may think how can they do it without breaking the core networking, that is the line I think we should explore, can we create a better network interface so all these communities can build on top of Kubernetes without breaking the project?

In kubernetes networking we don't have a good interface between the network an other components like device plugins, CSI drivers ... the network now is behind the runtime and we are with the CNI2.0 discussion forever, @squeed had presented some draft about enhancing CRI. New workloads need to do gymnastics with annotations or CRDs and create complex dependencies between components using the control plane as a communication bus, that makes the network solutions hard to support and unreliable, as CreatePod execs CNI plugin that calls apiserver and somehow is ratelimited or wrong auth and fails to create a pod in a crashloop ... can we do better? can we solve some of the existing problems?

any of this has to happen without a kubernetes 2.0 mindset of " ok , let's throw it everything and start over", or "there are 5 competing standards, lets's create a new one" ... we need to dig deeper into what are fundamental problems and what are nice to have things, per example

multinetwork as multiple interfaces has a lot of problems solved with multus, is just that the implementation needs to workaround our lack of interfaces and that causes problems, are those fundamental problems or is that good enough?

Multinetwork as a Pods with multiple networks and multitenancy as in an IaaS to me is clear not a problem for kubernetes core to solve, kubevirt has demonstrated that and we already provide the extensions necessary to create that out of core.

AI/ML need to plug-in netdevices, OCI spec only considers block devices, opencontainers/runtime-spec#1239, that are different for the linux kernel

Services meshes need to hook into CNI to diver traffic, is that a problem of Service meshes not implementing CNI interface directly? or is a layering problem on the network provisioning?

As I said, I have a lot of questions and I can see lot of requests and solutions that still don't answer my questions, I think we should start having first a good understanding of the problems, categorize them and then think in solutions for those problems

KNI is about getting these conversations moving. However, we should be an effort to act, not just talk. The networking conversations generally fall flat as people become burnt out and eventually just leave. Let’s be a proposal of energy, not the opposite.

We agree something is wrong, people are hacking the CNI today and networking is spread across all three layers (K8s, container runtime and oci runtimes). I do believe that we can begin to build a foundational network api that can help resolve some of the issues you mention by moving networking to a single location and providing an API to do creation of the network namespace, creation of the interfaces, configuration, … and the reverse to delete. We should also provide a hooking mechanism for people to extend, this would be for the service meshes that don’t want to be the primary network plugin. I have asked why service meshes don’t want to be a network plugin and have always gotten push back so I don’t want to use this proposal to say “hey service meshes become a network plugin”. I don’t want to also be in the business of tying existing solutions together and having tentacles reaching into all the various components that sound large and impactful.

For KNI, we shouldn’t have the toss everything and start from scratch. We should provide a backwards compatible model where existing network plugins can be run while they migrate over to KNI. In fact the demo’s I have put together do that and converting an existing network plugin was less than 30 minutes of work and a couple lines of code. And benefits were seen quickly as the pod setup for the network plugin daemon started almost 95% faster on average as the CNI binaries could be packaged into the image.

Multi-network vs Multi-interface are effectively the same when implemented however they have two different mindsets. “Solved with multus” might be an opinion here as you mention above the issues with this approach. The network API could potentially be the glue magic that irons these issues out.

You mention the issues with the current pod creation, what about when the IPAM has no more available IP addresses? This is a problem we should solve and its not just set node to not ready and evict pods, we need a better approach to say “no more”.

You mentioned the OCI spec, this is between the high and low level runtimes and goes back to my original point of networking is across multiple layers of the stack. Lets consolidate this into one and ensure it works for virtualized/non-virtualized runtimes. The latest change to the POC was to support this actually in a uniform way.

MikeZappa87 avatar Feb 07 '24 15:02 MikeZappa87

folks, these are not user stories, these are developer stories to solve some user stories (that are not specified), I'd like to read the problems we are solving for the users, how are they solved today and the existing pain points ... then we evaluate what is the best technical solution, this has the risk of creating a hammer and then think everything is a nail

aojea avatar Feb 14 '24 11:02 aojea

folks, these are not user stories, these are developer stories to solve some user stories (that are not specified), I'd like to read the problems we are solving for the users, how are they solved today and the existing pain points ... then we evaluate what is the best technical solution, this has the risk of creating a hammer and then think everything is a nail

We can certainly update the user stories. However, I want to remain consistent here and have some questions.

Why do some KEP's need user stories and others don't? Most KEP's don't have a distinction between user and developer story, I feel we should have both thoughts? If I read back, a lot of "user stories" aren't even properly formatted.

Are you able to provide an example of an acceptable user story, I know you know enough about what KNI is trying to do and the current problems this ecosystem faces (aka don't provide the awk without the fix). Are you wanting us to state more of the obvious? Such as a user of a workload, I need the ability access my pod, so that I can use my workload? Aka a story stating I need an IP address without saying IP address.

MikeZappa87 avatar Feb 14 '24 15:02 MikeZappa87

folks, these are not user stories, these are developer stories to solve some user stories (that are not specified), I'd like to read the problems we are solving for the users, how are they solved today and the existing pain points ... then we evaluate what is the best technical solution, this has the risk of creating a hammer and then think everything is a nail

We can certainly update the user stories. However, I want to remain consistent here and have some questions.

Why do some KEP's need user stories and others don't? Most KEP's don't have a distinction between user and developer story. If I read back, a lot of "user stories" aren't even properly formatted.

Are you able to provide an example of an acceptable user story, I know you know enough about what KNI is trying to do and the current problems this ecosystem faces. Are you wanting us to state more of the obvious? Such as a user of a workload, I need the ability access my pod, so that I can use my workload? Aka a story stating I need an IP address without saying IP address.

Valid questions. At least from my perspective, I raised similar concerns in the meeting.

Reading the KEP again now, I am not sure what are we trying to solve. I understand some of the bullets under the "User Stories" section but what I do not understand is how would a developer or a user would benefit from them. And thats the main reason why good user stories could help.

Taking story 2 as an example: "As a cluster operator, I need the ability to determine what networks are available on my node so that upstream components can ensure the pod is scheduled on the appropriate node." Whats missing is what the cluster operator is doing now to achieve similar result. Are they even need to ensure the pod is scheduled on the appropriate node (in the current design)? How would the developer/user experience improve if we implemented KNI? (time? complexity? maintenance? new capabilities that users want?)

I suggest to focus on 2/3 main stories (perhaps maybe the main reasons that brought you to open this KEP) and not trying to tackle a wide range of possible user stories that users might not even need. I think this will shorten the time for initial consensus.

LiorLieberman avatar Feb 14 '24 15:02 LiorLieberman

Why do some KEP's need user stories and others don't? Most KEP's don't have a distinction between user and developer story, I feel we should have both thoughts?

There are KEPs that are straightforward and are mostly a description of the problem and the solution https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/2595-expanded-dns-config . There are other KEPs that are controversial or can be disruptive, there are a lot of questions as we want to be completely sure what problem are we solving and what are the consequences and tradeoffs and the alternatives , see the number of comments on other KEPs

image

If I read back, a lot of "user stories" aren't even properly formatted.

I don't stop saying it, "we need reviewers" , so people want to help and contribute, but I don't see anybody doing reviews, review is free and you learn a lot and start to have context to know why some things are the way they are, because you were in the discussion when that was decided ... if you review a KEP and the user story is not clear just make it clear in your review before it merges ... if you feel that something merged and the assumption turned to be wrong, then open an issue and fix it, we reverted ClusterCIDR https://github.com/kubernetes/kubernetes/pull/121229 and provided an alternative out of tree because we realized was not the right thing for the project ...

Let's go to the spirit of the norm of the user story, we don't want to be pedantic as in perfect agile, we just want to know the problem we are trying to solve, feel free to use the wording and the context you want to provide, the important is to have clear what is the problem we are solving for the end users, how all these changes are going to benefit kubernetes users and what things are going to be improved ... we went through this with the KPNG KEP too https://github.com/kubernetes/enhancements/pull/3788#issuecomment-1410635947

Are you able to provide an example of an acceptable user story,

fair, let me put an example so we are in the same page, I also want to recommend @danwinship KEP, that should be a reference for all of us https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/3866-nftables-proxy

User story 1

"As a kubernetes user that need to deploy AI/ML or Telco workloads, I need to move inside my Pods some Network interfaces from the host so my applications can use them, an I'd like to have a better UX and reliability using netdevices as the existing one with other type of devices. The existing projects solving this problem, typically multus or some network plugins, have to depend on out of band communications channels like annotations or CRDs and, since the Pod creation is a local and imperative request from the Kubelet to the container runtime through the CRI API, when the runtimes makes the CNI ADD request, this needs one or more additional roundtrips to the apiserver that cause a local process to depend on the control plane, limiting the performance, scalability and reliability of the solution. and making it painful to troubleshoot"

Questions to this user story:

  • Only for existing netdevices on the hosts or we want creation of netdevices too?
  • only physical or physical and virtual netdevices?
  • Some of this netdevices require provisioning and configuration, is this part of the API too or is the netdvice plugin able to make this without more data?
  • Is netdevice a CNI thing? or a container runtime thing? It can not be kubelet because the container runtimes creates the network namespace, or can it? is this simpler or more complex? how do we proof it?

Alternative 1: Device plugin like

https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/

Problem: runtime spec does not have the concept of netdevice https://github.com/opencontainers/runtime-spec/issues/1239

  • Pros
  • Cons

Alternative 2: DRA

https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation Is this good enough to solve all the problems?

  • Pros
  • Cons

Alternative 3: new API

apiVersion: v1
kind: Pod
metadata:
  name: pod-with-block-volume
spec:
   netDevices:
        - name: eth0
           hostInterface: enps0
          type: physical
  • Pros
  • Cons

Who consumes the API and how? is the CNI plugin? if not, are the runtimes going to

Alternative 4: NRI plugins

It seems only implemented in containerd and crio, what about kata and others, do they need it?

  • Pros
  • Cons

Alternative 5: CNI chaining plugins

we still have the problem of passing the metadata at runtime

  • Pros
  • Cons ...

References:

  • https://github.com/kubernetes/kubernetes/issues/60748

aojea avatar Feb 14 '24 19:02 aojea

Why do some KEP's need user stories and others don't? Most KEP's don't have a distinction between user and developer story, I feel we should have both thoughts?

There are KEPs that are straightforward and are mostly a description of the problem and the solution https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/2595-expanded-dns-config . There are other KEPs that are controversial or can be disruptive, there are a lot of questions as we want to be completely sure what problem are we solving and what are the consequences and tradeoffs and the alternatives , see the number of comments on other KEPs

image

If I read back, a lot of "user stories" aren't even properly formatted.

I don't stop saying it, "we need reviewers" , so people want to help and contribute, but I don't see anybody doing reviews, review is free and you learn a lot and start to have context to know why some things are the way they are, because you were in the discussion when that was decided ... if you review a KEP and the user story is not clear just make it clear in your review before it merges ... if you feel that something merged and the assumption turned to be wrong, then open an issue and fix it, we reverted ClusterCIDR kubernetes/kubernetes#121229 and provided an alternative out of tree because we realized was not the right thing for the project ...

Let's go to the spirit of the norm of the user story, we don't want to be pedantic as in perfect agile, we just want to know the problem we are trying to solve, feel free to use the wording and the context you want to provide, the important is to have clear what is the problem we are solving for the end users, how all these changes are going to benefit kubernetes users and what things are going to be improved ... we went through this with the KPNG KEP too #3788 (comment)

Are you able to provide an example of an acceptable user story,

fair, let me put an example so we are in the same page, I also want to recommend @danwinship KEP, that should be a reference for all of us https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/3866-nftables-proxy

User story 1

"As a kubernetes user that need to deploy AI/ML or Telco workloads, I need to move inside my Pods some Network interfaces from the host so my applications can use them, an I'd like to have a better UX and reliability using netdevices as the existing one with other type of devices. The existing projects solving this problem, typically multus or some network plugins, have to depend on out of band communications channels like annotations or CRDs and, since the Pod creation is a local and imperative request from the Kubelet to the container runtime through the CRI API, when the runtimes makes the CNI ADD request, this needs one or more additional roundtrips to the apiserver that cause a local process to depend on the control plane, limiting the performance, scalability and reliability of the solution. and making it painful to troubleshoot"

Questions to this user story:

  • Only for existing netdevices on the hosts or we want creation of netdevices too?
  • only physical or physical and virtual netdevices?
  • Some of this netdevices require provisioning and configuration, is this part of the API too or is the netdvice plugin able to make this without more data?
  • Is netdevice a CNI thing? or a container runtime thing? It can not be kubelet because the container runtimes creates the network namespace, or can it? is this simpler or more complex? how do we proof it?

Alternative 1: Device plugin like

https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/

Problem: runtime spec does not have the concept of netdevice opencontainers/runtime-spec#1239

  • Pros
  • Cons

Alternative 2: DRA

https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation Is this good enough to solve all the problems?

  • Pros
  • Cons

Alternative 3: new API

apiVersion: v1
kind: Pod
metadata:
  name: pod-with-block-volume
spec:
   netDevices:
        - name: eth0
           hostInterface: enps0
          type: physical
  • Pros
  • Cons

Who consumes the API and how? is the CNI plugin? if not, are the runtimes going to

Alternative 4: NRI plugins

It seems only implemented in containerd and crio, what about kata and others, do they need it?

  • Pros
  • Cons

...

References:

This is excellent! This is what I have been wanting :-)

MikeZappa87 avatar Feb 14 '24 20:02 MikeZappa87

Can you squash the commits? I find it hard to comment lines in the KEP, because I don't know which commit to use?

So I include quotes instead below.

the container runtime with network namespace creation and CNI plugins and the OCI runtime which does additional network setup for kernel isolated pods.

Can you give examples? I am pretty well aware of how things work (or so I thought), and my interpretation is:

  • "container runtime" examples crio, containerd
  • CNI-plugin, examples Cilium, Calico
  • OCI runtime, examples runc, crun

Is that correct?

And what is "kernel isolated pods"? That I don't know. Is it kata containers for instance (VMs)?

uablrek avatar Feb 20 '24 08:02 uablrek

however the current approach leaves files in the host filesystem such as CNI binaries and CNI configuration. These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state.

I guess "the pod" refers to a CNI-plugin agent POD, e.g. calico-node-6x5z8? To me the phrasing intends to imply that general POD creation suffers from this, and would be improved by KNI. That is not true, and IMO you should scratch the

~These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state.~

And, the files are not "usually downloaded", they are copied from the init-container to the host fs.

However, I agree that files in the host filesystem should be avoided.

uablrek avatar Feb 20 '24 08:02 uablrek

however the current approach leaves files in the host filesystem such as CNI binaries and CNI configuration. These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state.

I guess "the pod" refers to a CNI-plugin agent POD, e.g. calico-node-6x5z8? To me the phrasing intends to imply that general POD creation suffers from this, and would be improved by KNI. That is not true, and IMO you should scratch the

~These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state.~

And, the files are not "usually downloaded", they are copied from the init-container to the host fs.

However, I agree that files in the host filesystem should be avoided.

For most cases, the init container downloads the primary plugin aka flannel, cilium however when they leverage the community plugins they are downloaded. This usually happens at a different time though usually container runtime install. containerd install script does this.

I want to avoid this and have all dependencies inside of the container image and no longer needing to be in the host file system

MikeZappa87 avatar Feb 20 '24 20:02 MikeZappa87

For most cases, the init container downloads the primary plugin aka flannel, cilium however when they leverage the community plugins they are downloaded. This usually happens at a different time though usually container runtime install. containerd install script does this.

No, they copy their primary plugin and the community plugins they need from their init-container (which usually uses the same image as the "main" one, but with different start-up). The only community plugins that is required on start is "loopback".

Examples what cni-plugins install beside their primary plugins:

  • Cilium: doesn't install, or need, any community plugins.
  • Kindnet: bridge, host-local, ptp, portmap
  • Calico: bandwidth host-local portmap tuning and (to my suprise) flannel
  • Antrea: bandwidth portmap
  • Flannel: doesn't install any community plugins, but hangs forever if "portmap" isn't there (a bug)

But none of them downloads anything on installation. I set ip ro replace default dev lo to make sure.

But in any case, this is misleading:

~These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state.~

It should be something like:

It would save a second or so on node reboot if cni-plugin init-containers wouldn't have to copy plugins to the host fs.

uablrek avatar Feb 21 '24 07:02 uablrek

I discovered that you can add sequence diagrams in github with mermaid, and I really like sequence diagrams :smile:

I have created some really simplified ones that describe network setup with and without KNI.

And, I know that KNI can do more than call a CNI-plugin, so please don't start a discussion on that. It's just an example and it must be there for backward compatibility.

I have not worked with a CRI-plugin, so I may have got the interaction with the OCI runtime (runc, crun) all wrong.

Current network setup

sequenceDiagram
participant api as API server
participant kubelet as Kubelet
participant cri as CRI plugin
participant oci as OCI runtime
participant cni as CNI plugin
api->>kubelet: Notify POD update
kubelet->>cri: Create POD
activate cri
cri->>oci: Create sandbox
activate oci
oci-->>cri: response
deactivate oci
cri->>cni: ADD cmd (exec with files)
activate cni
cni-->>cri: response
deactivate cni
cri->>oci: Create containers
activate oci
oci-->>cri: response
deactivate oci
cri-->>kubelet: response
deactivate cri
kubelet->>api: Update status

Network setup with KNI

sequenceDiagram
participant api as API server
participant kubelet as Kubelet
participant cri as CRI plugin
participant oci as OCI runtime
participant kni as KNI agent
participant cni as CNI plugin
api->>kubelet: Notify POD update
kubelet->>cri: Create POD
activate cri
cri->>oci: Create sandbox
activate oci
oci-->>cri: response
deactivate oci
cri-->>kubelet: response
deactivate cri
kubelet->>kni: Setup network (gRPC)
activate kni
kni-->>api: read objects
kni->>cni: ADD cmd (exec w stdin)
activate cni
cni-->>kni: response
deactivate cni
kni-->>kubelet: response
deactivate kni
kubelet->>cri: Create containers
activate cri
cri->>oci: Create containers
activate oci
oci-->>cri: response
deactivate oci
cri-->>kubelet: response
deactivate cri
kubelet->>api: Update status

uablrek avatar Feb 22 '24 10:02 uablrek

For most cases, the init container downloads the primary plugin aka flannel, cilium however when they leverage the community plugins they are downloaded. This usually happens at a different time though usually container runtime install. containerd install script does this.

No, they copy their primary plugin and the community plugins they need from their init-container (which usually uses the same image as the "main" one, but with different start-up). The only community plugins that is required on start is "loopback".

Examples what cni-plugins install beside their primary plugins:

  • Cilium: doesn't install, or need, any community plugins.
  • Kindnet: bridge, host-local, ptp, portmap
  • Calico: bandwidth host-local portmap tuning and (to my suprise) flannel
  • Antrea: bandwidth portmap
  • Flannel: doesn't install any community plugins, but hangs forever if "portmap" isn't there (a bug)

But none of them downloads anything on installation. I set ip ro replace default dev lo to make sure.

But in any case, this is misleading:

~These files are usually downloaded via the init container of the pod after binding, which increases the time for the pod to get to a running state.~

It should be something like:

It would save a second or so on node reboot if cni-plugin init-containers wouldn't have to copy plugins to the host fs.

I believe flannel requires bridge plugin. I just installed that to make sure I wasn't crazy. From the performance testing I have done, KNI is faster taking a consistent 1 second vs 9-23 seconds for the network pod setup. The pod network setup is faster as well.

MikeZappa87 avatar Feb 22 '24 16:02 MikeZappa87

I discovered that you can add sequence diagrams in github with mermaid, and I really like sequence diagrams 😄

I have created some really simplified ones that describe network setup with and without KNI.

And, I know that KNI can do more than call a CNI-plugin, so please don't start a discussion on that. It's just an example and it must be there for backward compatibility.

I have not worked with a CRI-plugin, so I may have got the interaction with the OCI runtime (runc, crun) all wrong.

Current network setup

sequenceDiagram
participant api as API server
participant kubelet as Kubelet
participant cri as CRI plugin
participant oci as OCI runtime
participant cni as CNI plugin
api->>kubelet: Notify POD update
kubelet->>cri: Create POD
activate cri
cri->>oci: Create sandbox
activate oci
oci-->>cri: response
deactivate oci
cri->>cni: ADD cmd (exec with files)
activate cni
cni-->>cri: response
deactivate cni
cri->>oci: Create containers
activate oci
oci-->>cri: response
deactivate oci
cri-->>kubelet: response
deactivate cri
kubelet->>api: Update status

Network setup with KNI

sequenceDiagram
participant api as API server
participant kubelet as Kubelet
participant cri as CRI plugin
participant oci as OCI runtime
participant kni as KNI agent
participant cni as CNI plugin
api->>kubelet: Notify POD update
kubelet->>cri: Create POD
activate cri
cri->>oci: Create sandbox
activate oci
oci-->>cri: response
deactivate oci
cri-->>kubelet: response
deactivate cri
kubelet->>kni: Setup network (gRPC)
activate kni
kni-->>api: read objects
kni->>cni: ADD cmd (exec w stdin)
activate cni
cni-->>kni: response
deactivate cni
kni-->>kubelet: response
deactivate kni
kubelet->>cri: Create containers
activate cri
cri->>oci: Create containers
activate oci
oci-->>cri: response
deactivate oci
cri-->>kubelet: response
deactivate cri
kubelet->>api: Update status

We actually have a backwards compatible model with a roadmap to deploy side by side, then with libkni. However your diagram is pretty much spot on. If you want, I can set up some time with you to go over the migration stories with demos?

MikeZappa87 avatar Feb 22 '24 16:02 MikeZappa87

Can you squash the commits? I find it hard to comment lines in the KEP, because I don't know which commit to use?

So I include quotes instead below.

the container runtime with network namespace creation and CNI plugins and the OCI runtime which does additional network setup for kernel isolated pods.

Can you give examples? I am pretty well aware of how things work (or so I thought), and my interpretation is:

  • "container runtime" examples crio, containerd
  • CNI-plugin, examples Cilium, Calico
  • OCI runtime, examples runc, crun

Is that correct?

And what is "kernel isolated pods"? That I don't know. Is it kata containers for instance (VMs)?

I was referring to kernel isolated pods as pods that leverage kata or kubevirt. The additional network setup happens for both kata/kubevirt use cases after the CNI add. This is done in a couple ways, for Kata the setup happens through the execution of the kata-runtime via containerd/cri-o. In containerd its via the StartTask thus not clear that additional networking is happening.

I can squash the commits as well. It might make sense to move the contents to a google doc as well?

MikeZappa87 avatar Feb 22 '24 18:02 MikeZappa87

If use cases and design are still being collected, a working group may be more appropriate than a KEP.

BenTheElder avatar Feb 22 '24 20:02 BenTheElder

If use cases and design are still being collected, a working group may be more appropriate than a KEP.

Lets sync up on slack and schedule some time to discuss.

MikeZappa87 avatar Feb 22 '24 20:02 MikeZappa87

Are there any serious KNI use-cases that don't include multi-networking? KNI doesn't enable multiple interfaces, but I think that something like KNI is a prerequisite for K8s multi-net. But to independently motivate both KNI and K8s-multi-net with multi-networking use-cases is very confusing. I hope Antonio's workshop at KubeCon will sort this out (great initiative! But I can't attend myself unfortunately).

But the referred comment above must be considered if multiple interfaces are handled via KNI: who cleans up when a POD is deleted?

Obviously someone who knows what interfaces that are in the POD. So if something except kubelet calls the KNI to add various interfaces, kubelet would be unaware of them and there is a problem.

uablrek avatar Feb 23 '24 11:02 uablrek