community
community copied to clipboard
network-binding-plugin: add plugin for vhostuser interfaces.
What this PR does / why we need it:
This design proposal aims at implementing a new network binding plugin to support vhostuser interfaces. This will allow fast userspace datapath when used with a userspace dataplane like OVS-DPDK or VPP. This design proposal takes into consideration sockets sharing issues between kubevirt and dataplane pods.
Special notes for your reviewer:
Checklist
This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR. Approvers are expected to review this list.
- [X] Design: A design document was considered and is present.
- [X] PR: The PR description is expressive enough and will help future contributors
- [X] Community: Announcement to kubevirt-dev was considered
Release note:
NONE
Hi @bgaussen. Thanks for your PR.
PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/sig-network /assign
/sig network
I only found this discussion now, so I am very late in offering our experience:
We are productively using a self-build vhost-user management:
- An admission webhook patches all pods with label
kubevirt.io=virt-launcherto contain ahostPathvolume(/run/vpp, in our case) and correspondingvolumeMounts - A sidecar that patches the xml in a way very similar to this proposal. We add a new PCI subtree for the virtual devices, but I know that we are breaking kubevirt expectations there. Please note that having
<reconnect enabled="yes"/>is useful here.
In our case, the driver creating the socket also creates a config.json in the same folder that is read by the sidecar to communicate information to the sidecar.
I will join the meeting on Wednesday.
Hi all,
I just committed modifications to this design proposal PR to focus on the device plugin based solution to facilitate socket sharing, as discussed during last week meeting.
Regards,
Benoit.
Hi all,
A little update: we're currently implementing the Device Plugin for sockets sharing. We're implementing the device-info spec to share device information between Device Plugin and CNI, and we'll rely on DownwardAPI support in Kubevirt 1.3 to get network-status annotation in the network binding plugin (as suggested by @ormergi, thx 😁).
I'll update the Design Proposal ASAP.
Hi All,
We made great progress and implemented Device Plugin (DP), CNI and Network Binding Plugin (NBP):
- DP handles socketDirs where vhostuser sockets will be created, and injects the mount into the compute container requesting the socket device.
- DP is compliant with device-info-spec , as pointed by @ormergi, and creates device plugin devinfo files
- CNI implements also device-info-spec. That allows to get allocated device spec, including the socketDir path. It enables the CNI to push the spec, through multus, to
virt-launcherpod annotations (that is available at the end inkubevirt.io/network-infoannotation) - NBP retrieves the socketDir path from the annotation and DownwardAPI and creates the vhostuser interface in DomainXML with the right path to the socket
Example of this interface:
<interface type='vhostuser'>
<mac address='02:15:7e:00:00:05'/>
<source type='unix' path='/var/lib/sockets/socketDir04/pod6c270ef2f25' mode='server'/>
<target dev='pod6c270ef2f25'/>
<model type='virtio-non-transitional'/>
<driver name='vhost' queues='4' rx_queue_size='1024' tx_queue_size='1024'/>
<alias name='ua-net1'/>
<address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
</interface>
where /var/lib/sockets/socketDir04 is a random directory managed by the DP.
So far so good... but...
When we live migrate the VM, a new destination pod is created, with a new socket request to DP. The socketDir will obviously not be the same as the one on source pod (ex: /var/lib/sockets/socketDir09)
But from what I understand the source domainXML is sent to the destination pod and qemu tries to start with the source socketDir, and fails:
compute {"component":"virt-launcher","kind":"","level":"error","msg":"Live migration failed.","name":"ubuntu-vm1","namespace":"tests","pos":"live-migration-source.go:845","reason":"error encountered during Migrate
ToURI3 libvirt api call: virError(Code=1, Domain=10, Message='internal error: process exited while connecting to monitor: 2024-09-06T13:12:18.279191Z qemu-kvm: -chardev socket,id=charua-net1,path=/var/lib/sockets/
socketDir04/pod6c270ef2f25,server=on: Failed to bind socket to /var/lib/sockets/socketDir04/pod6c270ef2f25: No such file or directory')","timestamp":"2024-09-06T13:12:18.334616Z","uid":"33c51f90-1843-48ed-851f-4cb
11d75af41"}
Do you have any idea on how to manage this situation? Is there any plan to implements something through NBP livemigrate methods that could alter the destination domainXML?
As a "temporary" workaround, we intend to use a mutating webhook to inject a shared empty-dir volume in virt-launcher pod spec, that will be shared between NBP and compute containers (mounted in /var/lib/vhost-sockets). NBP will use the info from the kubevirt.io/network-info annotation/downwardAPI to create a link in the shared volume (/var/lib/vhost-sockets/net1 -> /var/lib/sockets/socketDir09).
The vhostuser socket path in domainXML will be /var/lib/vhostsockets/net1/pod6c270ef2f25, insuring that it will be accessible from the compute container and be the same on source and destination pods in live migration case.
Do you think the NBP spec could be amended to add a shared volume between compute and NBD sidecars?
Thanks,
Benoit.
Hi @bgaussen , well done!
It looks like Kubevirt calls sidecar hooks also at the migration target here
Could you please ensure the NBP sidecar executes at the target?
Hi @bgaussen , well done!
It looks like Kubevirt calls sidecar hooks also at the migration target here
Could you please ensure the NBP sidecar executes at the target?
Yes the sidecar hooks get executed, and from its logs it sent back the modified domainXML with the right socketDir on destination pod.
But from what I understand here, the source domain is used for MigrateToURI3 libvirt call, with little modification in PrepareDomainForMigration.
Hi All,
After some private discussion with @alicefr, it seems that the migration domain takes precedence over the one generated at pod creation. From what I understand, it could be difficult to fix this behavior but I will open an issue with that.
In the meantime, for live migration to work, we need the sockets paths be the same on the source and target pods. Hence the symlinks mechanism seems a good way to achieve that.
But we need the Network Binding Plugin to access the compute container mount namespace, with two options:
-
A new shared empty-dir volume between
computeand NBP sidecar containers. It requires to be added to NBP spec. Could be interesting if other NBP requires such a feature... -
Enable the
shareProcessNamespacepod feature onvirt-launcher. The sidecar container can target on of thecomputecontainer process, and access its mount namespace through/proc/PID/root, where the NBP can create the symlink. There is already this PR we can benefit of: kubevirt/kubevirt#11845
We already implemented and tested 2., it works as expected. The only drawback is that we need to target virtlogd process. For other ones (virt-launcher-monitor, virt-launcher, qemu*), /proc/PID/root is not accessible.
We can discuss this in a community meeting and/or mailing list.
Benoit.
@bgaussen Hi Benoit, just to let you know your most recent commit was not signed. I know you're still working on this so whenever you next squash or rebase please ensure that all commits are signed off to avoid any issues further down the line. Thanks!
In the meantime, for live migration to work, we need the sockets paths be the same on the source and target pods. Hence the symlinks mechanism seems a good way to achieve that.
I think that shareProcessNamespace is problematic. It flattens the process NS from the whole pod and in practice drops an expected isolation of containers in the pod.
It feels to me like a hack with major implications which are not easy to asses.
I think there is another option which could be explored: Using libvirt hooks. Libvirt does have an option to mutate the domain configuration on the target through script files, which could answer the need here. In the past, at least at other projects (e.g. oVirt) this direction encountered opposition but I do not recall the exact reasons.
Libvirt hooks are documented here: https://wiki.archlinux.org/title/Libvirt#Hooks https://libvirt.org/hooks.html
I think that
shareProcessNamespaceis problematic. It flattens the process NS from the whole pod and in practice drops an expected isolation of containers in the pod. It feels to me like a hack with major implications which are not easy to asses.
Understand this drawback with shareProcessNamespace, but as discussed with @alicefr, if it is to be implemented in the PR #11845, we could be benefit of it.
If not we need another way.
I think there is another option which could be explored: Using libvirt hooks. Libvirt does have an option to mutate the domain configuration on the target through script files, which could answer the need here. In the past, at least at other projects (e.g. oVirt) this direction encountered opposition but I do not recall the exact reasons.
Libvirt hooks are documented here: https://wiki.archlinux.org/title/Libvirt#Hooks https://libvirt.org/hooks.html
I briefly look at libvirt qemu hooks, indeed we could mutate the domain when it's received, before it's started. But we need to figure out how to add the scripts in compute container...
Another way would be to add a dedicated shared dir between compute container and hook sidecar binding plugin where we can create the links...
Benoit.
Another way would be to add a dedicated shared dir between compute container and hook sidecar binding plugin where we can create the links...
That you for helping me understand this well in the unconference meeting. This one seems to me a good candidate as well, especially when it is so simple. First, it can be easily implemented with a mutation webhook. I do agree it is better we provide it as a formal integration option between the compute and other sidecars.
Was this option challenged by some risk? @alicefr , having a well defined are where sidecars and the compute can share data seems a nice additions to the integration "protocol". What is your opinion on this option? (sorry if I missed a prev conversation about this)
Was this option challenged by some risk? @alicefr , having a well defined are where sidecars and the compute can share data seems a nice additions to the integration "protocol". What is your opinion on this option? (sorry if I missed a prev conversation about this)
There is already an option with a PVC and the sidecars , we could modify it to allow an empty dir as well.
Was this option challenged by some risk? @alicefr , having a well defined are where sidecars and the compute can share data seems a nice additions to the integration "protocol". What is your opinion on this option? (sorry if I missed a prev conversation about this)
There is already an option with a PVC and the sidecars , we could modify it to allow an empty dir as well.
@alicefr, @EdDev,
Following the unconference meeting and the discussion we had, we implemented the shared emtpy-dir logic with a mutating webhook.
The binding plugin creates the link (immutable path and socket name), in this empty-dir, to the real socket, and uses this link in the domainXML vhostuser interface.
This solves the issue observed with Live Migration.
This seems indeed to be the best solution, as there is already the basis to enable a new mount for this empty-dir for sidecar containers.
How long is drop using Live Migration?
How long is drop using Live Migration?
In our tests, we loose 2 ping packets during live migration, so around 2 seconds drop.
Hi, Just wondered what was needed to push this through?
@bgaussen Is this waiting for reviews of updates or is still in progress?
@bgaussen Is this waiting for reviews of updates or is still in progress?
@aburdenthehand, I updated de design proposal, with latest discussion we had. Waiting for reviews from @alicefr, @EdDev, @fabiand.
did this die?
did this die?
Hope it is not ! Waiting from community feedback, and how to proceed to go further.
By the way,
This design proposal aims at implementing a new network binding plugin to support vhostuser interfaces. This will allow fast userspace datapath when used with a userspace dataplane like OVS-DPDK or VPP.
Now that passt supports vhost-user operation, the integration in KubeVirt is in progress (one step here), and that will be another form of fast user-mode data path.
Perhaps this pull request (I didn't thoroughly go through all the commits, though) should mention this peculiarity ("hardware-attached"? "hardware direct"?) to avoid confusion.
How long is drop using Live Migration?
Also by the way, what we do in passt now is to save, migrate, and restore TCP queues using the TCP_REPAIR socket option. I guess one could do something similar with DPDK, eventually (migrating pending queue slots). I'm mentioning this because, as a step to enable passt migration, vhost-user-based migration of virtio-net back-ends was implemented in QEMU (https://patchew.org/QEMU/[email protected]/). A direct back-end could probably do something similar.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign vladikr for approval. For more information see the Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Pull requests that are marked with lgtm should receive a review
from an approver within 1 week.
After that period the bot marks them with the label needs-approver-review.
/label needs-approver-review
Hi @bgaussen Thanks for the feature and design.
Can you please check the approach we did about the requested shared volume ? https://github.com/kubevirt/kubevirt/pull/14623
The PR refactors the current shared mount, in a way that each container will have a dedicated SubPath, and then in the new sidecar volumeMount you will be able to create a folder and your socket in it. Thanks
Hi @oshoval ,
Thank you for the VEP and its implementation!
Can you please check the approach we did about the requested shared volume ?kubevirt/kubevirt#14623
The PR refactors the current shared mount, in a way that each container will have a dedicated SubPath, and then in the new sidecar volumeMount you will be able to create a folder and your socket in it. Thanks
This looks good to me. Thank you for having made it generic for any binding plugin. Adapting our vhostuser binding implementation, getting the CONTAINER_NAME env and creating the sub directory in the hooks socket shared dir should be quite straightforward.
Hi Benoit
https://github.com/kubevirt/kubevirt/pull/14768 was merged
just need to get the sidecar's mount path + new env var CONTAINER_NAME as you said to determine the path on compute container,
and to create a new folder on the sidecar that will host the new data plane socket,
all info should be in the PR, please let us know if anything is missing / required.
Please note that path of Unix socket is limited to 108 bytes https://linux.die.net/man/7/unix we have enough extra bytes, and i believe it is known, just in case
Thank you
Thank you @oshoval for your great work.
We'll implement the new env CONTAINER_NAME + dir ASAP in our binding plugin and keep you updated.