kube-ovn
kube-ovn copied to clipboard
[feature-request] Direct Attachable NICs for VM-based containers
Background
Kata Containers is an open source container runtime, building lightweight virtual machines that seamlessly plug into the containers ecosystem. It aims to bring the speed of a container and the security of a virtual machine to its users.
As Kata Containers matures, how it interacts with Kubernetes CNI and connects to the outside network, has become increasingly important. The issue covers the current status of the Kata Containers networking model, its pros and cons, and a proposal to further improve it. We'd like to work with the kube-ovn community to implement an optimized network solution for VM-based containers like Kata Containers.
Status
A classic CNI deployment would result in a networking model like below:
Where a pod sits inside a network namespace, and connects to the outside world via a veth pair. In order to work with this networking model, Kata Containers has implemented a TC based networking model.
Where inside the pod network namespace, a tap device tap0_kata is created and Kata sets up TC mirror rules to copy packages between eth0 and tap0_kata. The eth0 device is a veth pair endpoint and its peer is a veth device attached to the host bridge. So the data flow is like:
As we can see, there are as many as five jumps before a package can reach the guest on the host. The network stack jumps are costly and the architecture needs to be simplified.
Proposal
We can see that all Kata need is a tap device on the host, and it doesn't care how it is created (being it a tuntap, or a ovs tap, or a ipvtap, or a macvtap). So we can create a simple architecture and use tap devices (or similar devices) as the pod network setup entrypoint rather than veth pairs. Something like:
With this architecture, we can remove the need for a host network namespace, and the veth-pair to connect through it. And we don't care how the tap device is created so that CNI plugins can still have different implementation details hidden from us.
A possible control flow for the direct attachable CNIs:
To make it work, kube-ovn will need to be notified that the CNI ADD
command is to create a direct attachable network device, and return its information back to CRI runtime (e.g., containerd). Then CRI runtime can pass the NIC information to Kata Containers and it will be further handled there.
Please help to review/comment if the proposal is reasonable and doable. Thanks a lot!
Ref: Kata Containers corresponding issue https://github.com/kata-containers/kata-containers/issues/1922
After some investigation, the issue that kubelet will try to enter pod netns and inspect eth0 address is an implementation of dockershim. Unfortunately most of our users still use docker, so we have to adapt it in Kube-OVN side.
The step will like this:
- The Pod need a new annotation to tell kube-ovn to create a tap device rather than default veth pair. The annotation may like
ovn.kubernetes.io/pod_nic_type=tap
and it can be an installation options to set the default nic type totap
later. - When
CNI ADD
is invokedkube-ovn-cni
will read the annotation above to decide the pod nic type. Fortap
nic, it will create a tap device, link it to ovs and move it to the Pod netns and set the ip/mac/route. For compatible with dockershim, we also need to create a dummyeth0
with the same ip but the link status is down. -
kube-ovn-cni
then return CNI response with the tap device name in the interface field https://github.com/containernetworking/cni/blob/v0.8.1/pkg/types/current/types.go#L127 - Then containerd and kata can use this response to setup its own network.
We know that for kata the extra netns and addresses on the tap device is not required. But for other CRIs especially docker, these steps are required.
@oilbeater Thanks a lot! I agree that it is better to keep netns and addresses on the tap device for compatibility with other container runtimes.
As for the pod annotation ovn.kubernetes.io/pod_nic_type=tap
, can we make it something like ovn.kubernetes.io/pod_nic_direct_attachable=true
as we discussed during the hackathon? The idea is to make the interface general enough to allow VFIO or vhost-user based NICs to be usable in the same workflow.
While kube-ovn only implements tap-based NICs at the moment, we want the interface to be future-proof and allow more possibilities. And kube-ovn or other CNIs can choose to implement more NIC types in the future.
wdyt?
As for the pod annotation ovn.kubernetes.io/pod_nic_type=tap, can we make it something like ovn.kubernetes.io/pod_nic_direct_attachable=true
@bergwolf we already use this annotation to support veth
and ovs internal port
type nic. It's more nature to reuse this annotation and we can use different annotation values to implement different type interface in the future
@oilbeater Fair enough. We can make it (the annotation entirely) a config option for containerd so that kata can request different nic types via runtime handler config. Something like runtime_cni_annotations = ["annotation ovn.kubernetes.io/pod_nic_type=tap"]
for each runtime handler.
@bergwolf as we have discussed that when tap
device is moved to netns, OVS will lost connection to it. That means we have to leave the tap
device in host netns, however it will break other CRI‘s assumption about network.
Another way is to use ovs internal port which can be moved into netns and has better performance than veth-pair. Can you help to provide some guide about how qemu can integrated with ovs internal port? So that we can check if this method can work.
@oilbeater What is special with ovs internal port? QEMU works well with tap devices on the host. IIUC, ovs internal port is still a tap device to its users. If so, it should JUST WORK (TM) ;)
Any progress ?
I would be interested in this feature as well.
Does Kube-OVN provide the functionality to add a veth (or any other interface) to a Subnet ? Thanks to https://github.com/kubevirt/macvtap-cni/pull/98 we would then be able to connect to KubeVirt.
I managed to get this to working by adding a veth1 to the VMs via macvtap.
ip link add veth0 type veth peer name veth1
ip link set veth0 up
ip link set veth1 up
Then adding veth0 to kube-ovn with the following commands
# first node
kubectl ko vsctl node1 add-port br-int veth0
kubectl ko vsctl node1 set Interface veth0 external_ids:iface-id=veth0.node1
kubectl ko nbctl lsp-add subnet1 veth0.node1
# second node
kubectl ko vsctl node2 add-port br-int veth0
kubectl ko vsctl node2 set Interface veth0 external_ids:iface-id=veth0.node2
kubectl ko nbctl lsp-add subnet1 veth0.node2
Issues go stale after 60d of inactivity. Please comment or re-open the issue if you are still interested in getting this issue fixed.