community icon indicating copy to clipboard operation
community copied to clipboard

network-binding-plugin: add plugin for vhostuser interfaces.

Open bgaussen opened this issue 1 year ago • 25 comments

What this PR does / why we need it:

This design proposal aims at implementing a new network binding plugin to support vhostuser interfaces. This will allow fast userspace datapath when used with a userspace dataplane like OVS-DPDK or VPP. This design proposal takes into consideration sockets sharing issues between kubevirt and dataplane pods.

Special notes for your reviewer:

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR. Approvers are expected to review this list.

  • [X] Design: A design document was considered and is present.
  • [X] PR: The PR description is expressive enough and will help future contributors
  • [X] Community: Announcement to kubevirt-dev was considered

Release note:

NONE

bgaussen avatar May 22 '24 12:05 bgaussen

Hi @bgaussen. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

kubevirt-bot avatar May 22 '24 12:05 kubevirt-bot

/sig-network /assign

EdDev avatar May 22 '24 14:05 EdDev

/sig network

EdDev avatar Jun 02 '24 09:06 EdDev

I only found this discussion now, so I am very late in offering our experience:

We are productively using a self-build vhost-user management:

  1. An admission webhook patches all pods with label kubevirt.io=virt-launcher to contain a hostPath volume (/run/vpp, in our case) and corresponding volumeMounts
  2. A sidecar that patches the xml in a way very similar to this proposal. We add a new PCI subtree for the virtual devices, but I know that we are breaking kubevirt expectations there. Please note that having <reconnect enabled="yes"/> is useful here.

In our case, the driver creating the socket also creates a config.json in the same folder that is read by the sidecar to communicate information to the sidecar.

I will join the meeting on Wednesday.

toelke avatar Jun 17 '24 10:06 toelke

Hi all,

I just committed modifications to this design proposal PR to focus on the device plugin based solution to facilitate socket sharing, as discussed during last week meeting.

Regards,

Benoit.

bgaussen avatar Jun 26 '24 13:06 bgaussen

Hi all,

A little update: we're currently implementing the Device Plugin for sockets sharing. We're implementing the device-info spec to share device information between Device Plugin and CNI, and we'll rely on DownwardAPI support in Kubevirt 1.3 to get network-status annotation in the network binding plugin (as suggested by @ormergi, thx 😁).

I'll update the Design Proposal ASAP.

bgaussen avatar Jul 31 '24 12:07 bgaussen

Hi All,

We made great progress and implemented Device Plugin (DP), CNI and Network Binding Plugin (NBP):

  • DP handles socketDirs where vhostuser sockets will be created, and injects the mount into the compute container requesting the socket device.
  • DP is compliant with device-info-spec , as pointed by @ormergi, and creates device plugin devinfo files
  • CNI implements also device-info-spec. That allows to get allocated device spec, including the socketDir path. It enables the CNI to push the spec, through multus, to virt-launcher pod annotations (that is available at the end in kubevirt.io/network-info annotation)
  • NBP retrieves the socketDir path from the annotation and DownwardAPI and creates the vhostuser interface in DomainXML with the right path to the socket

Example of this interface:

    <interface type='vhostuser'>
      <mac address='02:15:7e:00:00:05'/>
      <source type='unix' path='/var/lib/sockets/socketDir04/pod6c270ef2f25' mode='server'/>
      <target dev='pod6c270ef2f25'/>
      <model type='virtio-non-transitional'/>
      <driver name='vhost' queues='4' rx_queue_size='1024' tx_queue_size='1024'/>
      <alias name='ua-net1'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </interface>

where /var/lib/sockets/socketDir04 is a random directory managed by the DP.

So far so good... but...

When we live migrate the VM, a new destination pod is created, with a new socket request to DP. The socketDir will obviously not be the same as the one on source pod (ex: /var/lib/sockets/socketDir09)

But from what I understand the source domainXML is sent to the destination pod and qemu tries to start with the source socketDir, and fails:

compute {"component":"virt-launcher","kind":"","level":"error","msg":"Live migration failed.","name":"ubuntu-vm1","namespace":"tests","pos":"live-migration-source.go:845","reason":"error encountered during Migrate
ToURI3 libvirt api call: virError(Code=1, Domain=10, Message='internal error: process exited while connecting to monitor: 2024-09-06T13:12:18.279191Z qemu-kvm: -chardev socket,id=charua-net1,path=/var/lib/sockets/
socketDir04/pod6c270ef2f25,server=on: Failed to bind socket to /var/lib/sockets/socketDir04/pod6c270ef2f25: No such file or directory')","timestamp":"2024-09-06T13:12:18.334616Z","uid":"33c51f90-1843-48ed-851f-4cb
11d75af41"}

Do you have any idea on how to manage this situation? Is there any plan to implements something through NBP livemigrate methods that could alter the destination domainXML?

As a "temporary" workaround, we intend to use a mutating webhook to inject a shared empty-dir volume in virt-launcher pod spec, that will be shared between NBP and compute containers (mounted in /var/lib/vhost-sockets). NBP will use the info from the kubevirt.io/network-info annotation/downwardAPI to create a link in the shared volume (/var/lib/vhost-sockets/net1 -> /var/lib/sockets/socketDir09). The vhostuser socket path in domainXML will be /var/lib/vhostsockets/net1/pod6c270ef2f25, insuring that it will be accessible from the compute container and be the same on source and destination pods in live migration case.

Do you think the NBP spec could be amended to add a shared volume between compute and NBD sidecars?

Thanks,

Benoit.

bgaussen avatar Sep 06 '24 15:09 bgaussen

Hi @bgaussen , well done!

It looks like Kubevirt calls sidecar hooks also at the migration target here

Could you please ensure the NBP sidecar executes at the target?

ormergi avatar Sep 08 '24 09:09 ormergi

Hi @bgaussen , well done!

It looks like Kubevirt calls sidecar hooks also at the migration target here

Could you please ensure the NBP sidecar executes at the target?

Yes the sidecar hooks get executed, and from its logs it sent back the modified domainXML with the right socketDir on destination pod.

But from what I understand here, the source domain is used for MigrateToURI3 libvirt call, with little modification in PrepareDomainForMigration.

bgaussen avatar Sep 08 '24 19:09 bgaussen

Hi All,

After some private discussion with @alicefr, it seems that the migration domain takes precedence over the one generated at pod creation. From what I understand, it could be difficult to fix this behavior but I will open an issue with that.

In the meantime, for live migration to work, we need the sockets paths be the same on the source and target pods. Hence the symlinks mechanism seems a good way to achieve that. But we need the Network Binding Plugin to access the compute container mount namespace, with two options:

  1. A new shared empty-dir volume between compute and NBP sidecar containers. It requires to be added to NBP spec. Could be interesting if other NBP requires such a feature...

  2. Enable the shareProcessNamespace pod feature on virt-launcher. The sidecar container can target on of the compute container process, and access its mount namespace through /proc/PID/root, where the NBP can create the symlink. There is already this PR we can benefit of: kubevirt/kubevirt#11845

We already implemented and tested 2., it works as expected. The only drawback is that we need to target virtlogd process. For other ones (virt-launcher-monitor, virt-launcher, qemu*), /proc/PID/root is not accessible.

We can discuss this in a community meeting and/or mailing list.

Benoit.

bgaussen avatar Sep 20 '24 07:09 bgaussen

@bgaussen Hi Benoit, just to let you know your most recent commit was not signed. I know you're still working on this so whenever you next squash or rebase please ensure that all commits are signed off to avoid any issues further down the line. Thanks!

aburdenthehand avatar Oct 15 '24 11:10 aburdenthehand

In the meantime, for live migration to work, we need the sockets paths be the same on the source and target pods. Hence the symlinks mechanism seems a good way to achieve that.

I think that shareProcessNamespace is problematic. It flattens the process NS from the whole pod and in practice drops an expected isolation of containers in the pod. It feels to me like a hack with major implications which are not easy to asses.

I think there is another option which could be explored: Using libvirt hooks. Libvirt does have an option to mutate the domain configuration on the target through script files, which could answer the need here. In the past, at least at other projects (e.g. oVirt) this direction encountered opposition but I do not recall the exact reasons.

Libvirt hooks are documented here: https://wiki.archlinux.org/title/Libvirt#Hooks https://libvirt.org/hooks.html

EdDev avatar Oct 28 '24 05:10 EdDev

I think that shareProcessNamespace is problematic. It flattens the process NS from the whole pod and in practice drops an expected isolation of containers in the pod. It feels to me like a hack with major implications which are not easy to asses.

Understand this drawback with shareProcessNamespace, but as discussed with @alicefr, if it is to be implemented in the PR #11845, we could be benefit of it. If not we need another way.

I think there is another option which could be explored: Using libvirt hooks. Libvirt does have an option to mutate the domain configuration on the target through script files, which could answer the need here. In the past, at least at other projects (e.g. oVirt) this direction encountered opposition but I do not recall the exact reasons.

Libvirt hooks are documented here: https://wiki.archlinux.org/title/Libvirt#Hooks https://libvirt.org/hooks.html

I briefly look at libvirt qemu hooks, indeed we could mutate the domain when it's received, before it's started. But we need to figure out how to add the scripts in compute container...

Another way would be to add a dedicated shared dir between compute container and hook sidecar binding plugin where we can create the links...

Benoit.

bgaussen avatar Oct 28 '24 12:10 bgaussen

Another way would be to add a dedicated shared dir between compute container and hook sidecar binding plugin where we can create the links...

That you for helping me understand this well in the unconference meeting. This one seems to me a good candidate as well, especially when it is so simple. First, it can be easily implemented with a mutation webhook. I do agree it is better we provide it as a formal integration option between the compute and other sidecars.

Was this option challenged by some risk? @alicefr , having a well defined are where sidecars and the compute can share data seems a nice additions to the integration "protocol". What is your opinion on this option? (sorry if I missed a prev conversation about this)

EdDev avatar Oct 31 '24 12:10 EdDev

Was this option challenged by some risk? @alicefr , having a well defined are where sidecars and the compute can share data seems a nice additions to the integration "protocol". What is your opinion on this option? (sorry if I missed a prev conversation about this)

There is already an option with a PVC and the sidecars , we could modify it to allow an empty dir as well.

alicefr avatar Nov 04 '24 08:11 alicefr

Was this option challenged by some risk? @alicefr , having a well defined are where sidecars and the compute can share data seems a nice additions to the integration "protocol". What is your opinion on this option? (sorry if I missed a prev conversation about this)

There is already an option with a PVC and the sidecars , we could modify it to allow an empty dir as well.

@alicefr, @EdDev,

Following the unconference meeting and the discussion we had, we implemented the shared emtpy-dir logic with a mutating webhook.

The binding plugin creates the link (immutable path and socket name), in this empty-dir, to the real socket, and uses this link in the domainXML vhostuser interface.

This solves the issue observed with Live Migration.

This seems indeed to be the best solution, as there is already the basis to enable a new mount for this empty-dir for sidecar containers.

bgaussen avatar Dec 05 '24 09:12 bgaussen

How long is drop using Live Migration?

mw-0 avatar Jan 02 '25 12:01 mw-0

How long is drop using Live Migration?

In our tests, we loose 2 ping packets during live migration, so around 2 seconds drop.

bgaussen avatar Jan 06 '25 10:01 bgaussen

Hi, Just wondered what was needed to push this through?

mw-0 avatar Jan 14 '25 09:01 mw-0

@bgaussen Is this waiting for reviews of updates or is still in progress?

aburdenthehand avatar Jan 15 '25 14:01 aburdenthehand

@bgaussen Is this waiting for reviews of updates or is still in progress?

@aburdenthehand, I updated de design proposal, with latest discussion we had. Waiting for reviews from @alicefr, @EdDev, @fabiand.

bgaussen avatar Jan 17 '25 15:01 bgaussen

did this die?

mw-0 avatar Feb 27 '25 15:02 mw-0

did this die?

Hope it is not ! Waiting from community feedback, and how to proceed to go further.

bgaussen avatar Mar 03 '25 10:03 bgaussen

By the way,

This design proposal aims at implementing a new network binding plugin to support vhostuser interfaces. This will allow fast userspace datapath when used with a userspace dataplane like OVS-DPDK or VPP.

Now that passt supports vhost-user operation, the integration in KubeVirt is in progress (one step here), and that will be another form of fast user-mode data path.

Perhaps this pull request (I didn't thoroughly go through all the commits, though) should mention this peculiarity ("hardware-attached"? "hardware direct"?) to avoid confusion.

How long is drop using Live Migration?

Also by the way, what we do in passt now is to save, migrate, and restore TCP queues using the TCP_REPAIR socket option. I guess one could do something similar with DPDK, eventually (migrating pending queue slots). I'm mentioning this because, as a step to enable passt migration, vhost-user-based migration of virtio-net back-ends was implemented in QEMU (https://patchew.org/QEMU/[email protected]/). A direct back-end could probably do something similar.

sbrivio-rh avatar Mar 10 '25 14:03 sbrivio-rh

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign vladikr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

kubevirt-bot avatar Apr 17 '25 16:04 kubevirt-bot

Pull requests that are marked with lgtm should receive a review from an approver within 1 week.

After that period the bot marks them with the label needs-approver-review.

/label needs-approver-review

kubevirt-bot avatar Apr 24 '25 18:04 kubevirt-bot

Hi @bgaussen Thanks for the feature and design.

Can you please check the approach we did about the requested shared volume ? https://github.com/kubevirt/kubevirt/pull/14623

The PR refactors the current shared mount, in a way that each container will have a dedicated SubPath, and then in the new sidecar volumeMount you will be able to create a folder and your socket in it. Thanks

oshoval avatar May 07 '25 07:05 oshoval

Hi @oshoval ,

Thank you for the VEP and its implementation!

Can you please check the approach we did about the requested shared volume ?kubevirt/kubevirt#14623

The PR refactors the current shared mount, in a way that each container will have a dedicated SubPath, and then in the new sidecar volumeMount you will be able to create a folder and your socket in it. Thanks

This looks good to me. Thank you for having made it generic for any binding plugin. Adapting our vhostuser binding implementation, getting the CONTAINER_NAME env and creating the sub directory in the hooks socket shared dir should be quite straightforward.

bgaussen avatar May 07 '25 20:05 bgaussen

Hi Benoit

https://github.com/kubevirt/kubevirt/pull/14768 was merged just need to get the sidecar's mount path + new env var CONTAINER_NAME as you said to determine the path on compute container, and to create a new folder on the sidecar that will host the new data plane socket, all info should be in the PR, please let us know if anything is missing / required.

Please note that path of Unix socket is limited to 108 bytes https://linux.die.net/man/7/unix we have enough extra bytes, and i believe it is known, just in case

Thank you

oshoval avatar May 29 '25 06:05 oshoval

Thank you @oshoval for your great work.

We'll implement the new env CONTAINER_NAME + dir ASAP in our binding plugin and keep you updated.

bgaussen avatar Jun 02 '25 08:06 bgaussen