cloud-hypervisor
cloud-hypervisor copied to clipboard
pvmemcontrol: control guest physical memory properties
I'm working on memory passthrough for lightweight VMs. We've come up with an approach that's guest driven and tries to keep the VM slim proactively. Pvmemcontrol is the name of the device/driver that communicates between the guest and vmm to control the host backing of guest memory.
Yuanchu Xie [email protected] Pasha Tatashin [email protected] @soleen
Pvmemcontrol provides a way for the guest to control its physical memory properties, and enables optimizations and security features. For example, the guest can provide information to the host where parts of a hugepage may be unbacked, or sensitive data may not be swapped out, etc.
Pvmemcontrol allows guests to manipulate its gPTE entries in the SLAT, and also some other properties of the memory map the back's host memory. This is achieved by using the KVM_CAP_SYNC_MMU capability. When this capability is available, the changes in the backing of the memory region on the host are automatically reflected into the guest. For example, an mmap() or madvise() that affects the region will be made visible immediately.
There are two components of the implementation: the guest Linux driver and Virtual Machine Monitor (VMM) device. A guest-allocated shared buffer is negotiated per-cpu through a few PCI MMIO registers, the VMM device assigns a unique command for each per-cpu buffer. The guest writes its pvmemcontrol request in the per-cpu buffer, then writes the corresponding command into the command register, calling into the VMM device to perform the pvmemcontrol request.
The synchronous per-cpu shared buffer approach avoids the kick and busy waiting that the guest would have to do with virtio virtqueue transport.
User API From the userland, the pvmemcontrol guest driver is controlled via ioctl(2) call. It requires CAP_SYS_ADMIN.
ioctl(fd, PVMEMCONTROL_IOCTL, struct pvmemcontrol_buf *buf);
Guest userland applications can tag VMAs and guest hugepages, or advise the host on how to handle sensitive guest pages.
Supported function codes and their use cases: PVMEMCONTROL_FREE/REMOVE/DONTNEED/PAGEOUT. For the guest. One can reduce the struct page and page table lookup overhead by using hugepages backed by smaller pages on the host. These pvmemcontrol commands can allow for partial freeing of private guest hugepages to save memory. They also allow kernel memory, such as kernel stacks and task_structs to be paravirtualized if we expose kernel APIs.
PVMEMCONTROL_UNMERGEABLE is useful for security, when the VM does not want to share its backing pages. The same with PVMEMCONTROL_DONTDUMP, so sensitive pages are not included in a dump. MLOCK/UNLOCK can advise the host that sensitive information is not swapped out on the host.
PVMEMCONTROL_MPROTECT_NONE/R/W/RW. For guest stacks backed by hugepages, stack guard pages can be handled in the host and memory can be saved in the hugepage.
PVMEMCONTROL_SET_VMA_ANON_NAME is useful for observability and debugging how guest memory is being mapped on the host.
Sample program making use of PVMEMCONTROL_DONTNEED: https://github.com/Dummyc0m/pvmemcontrol-user
Previously posted RFC to cloud-hypervisor: https://github.com/cloud-hypervisor/cloud-hypervisor/issues/6318
LKML posting of Linux guest driver: https://lore.kernel.org/lkml/[email protected]/
If I understand correctly, the guest can change/operate the host memory properties using the pvmemcontrol device, which I think may worry some public cloud users. So it might be better to add a feature like guest_debug to control this.
If I understand correctly, the guest can change/operate the host memory properties using the pvmemcontrol device, which I think may worry some public cloud users. So it might be better to add a feature like guest_debug to control this.
By default the device is not enabled, and I would say this is roughly in the same ballpark as virtio-balloon reporting free pages for the host to madvise away. Would you say that the device should be feature gated?
refreshed kernel patches to resolve sparse warnings https://lore.kernel.org/linux-mm/[email protected]/
A few comments:
- I think this should be gated by a flag and be disabled by default, because kernel code is not yet upstreamed.
- I think you should remove the reference to the prototype in your commit message.
- The device is really simple, and the code is self-contained, so I don't worry about it being overly buggy or anything. I can only speak for myself, but I'm happy to merge experimental code like this to nurture innovation.
I know there is a chicken-and-egg problem. Kernel wants to have some users before merging new code, while user space programs are hesitant to take in new code because kernel code can still change. Having the feature merged but disabled by default seems like a good way forward.
Lastly, I know it is not possible to test this right now, but if we merge this, please plan to add a test case when the kernel changes are merged.
Thanks Liu Wei, I agree on all three remarks, plus testing when the kernel changes are merged. Let me make the changes.
Seems like I missed a few things. Let me actually add the pre-commit hooks to my local setup and not forget to run some the checks every time.
@novakovic please don't push to the existing branch like that. The top commit you pushed is not signed off. It looks like you're making a minor change in numbering. You patch should be folded into the existing one.
@novakovic please don't push to the existing branch like that. The top commit you pushed is not signed off. It looks like you're making a minor change in numbering. You patch should be folded into the existing one.
Thank you so much for the pointer Wei, I will be folding this change in.
Chaegelog: Folded @novakovic's change Incorporated Wei's review comments Rebased on top of main Re-tested
I have a small patch to add a new build test. I can post that once this is merged.