kubevirt VM with GPU passthrough can't start while memory request

What happened: The vm with GPU passthrough can't start while memory request < limit, the virtualMachine definition spec like this:

spec:
  runStrategy: RerunOnFailure
  template:
      labels:
        kubevirt.io/domain: centos-gpu
    spec:
      domain:
        devices:
          disks:
          - disk:
              bus: virtio
            name: containerdisk
          gpus:
          - deviceName: nvidia.com/TU102_GEFORCE_RTX_2080_TI
            name: gpu1
        machine:
          type: q35
        resources:
          limits:
            memory: 6Gi
          requests:
            memory: 4Gi
      volumes:
      - containerDisk:
          image: xxxx/centos7:v1
          imagePullPolicy: IfNotPresent
        name: containerdisk

the logs of virt-launcher pod just like:

compute {"component":"virt-launcher","level":"error","msg":"At least one cgroup controller is required: No such device or address","pos":"virCgroupDetectControllers:455","subcomponent":"libvirt","thread":"36","timestamp":"2022-09-14T03:29:32.672000Z"}
compute {"component":"virt-launcher","level":"error","msg":"Unable to read from monitor: Connection reset by peer","pos":"qemuMonitorIORead:494","subcomponent":"libvirt","thread":"216","timestamp":"2022-09-14T03:29:33.804000Z"} 
compute {"component":"virt-launcher","level":"error","msg":"internal error: qemu unexpectedly closed the monitor: 2022-09-14T03:29:33.740944Z qemu-kvm: -device vfio-pci,host=0000:88:00.0,id=ua-gpu-gpu1,bus=pci.5,addr=0x0: VFIO_MAP_DMA failed: Cannot allocate memory","pos":"qemuProcessReportLogError:2046","subcomponent":"libvirt","thread":"216","timestamp":"2022-09-14T03:29:33.805000Z"}                 │
compute parsing time "2022-09-14T03:29:33.773415Z qemu-kvm" as "2006-01-02 15:04:05.999-0700": cannot parse "T03:29:33.773415Z qemu-kvm" as " "                                             
compute {"component":"virt-launcher","level":"info","msg":"Reaped pid 215 with status 256","pos":"virt-launcher.go:550","timestamp":"2022-09-14T03:29:33.821028Z"}                      
compute {"component":"virt-launcher","kind":"","level":"error","msg":"Failed to start VirtualMachineInstance with flags 0.","name":"centos-gpu","namespace":"default","pos":"manager.go:875","reason":"virError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2022-09-14T03:29:33.740944Z qemu-kvm: -device vfio-pci,host=0000:88:00.0,id=ua-gpu-gpu1,bus=pci.5,addr=0x0: VFIO_MAP_DMA failed: Cannot allocate memory\n2022-09-14T03:29:33.773415Z qemu-kvm: -device vfio-pci,host=0000:88:00.0,id=ua-gpu-gpu1,bus=pci.5,addr=0x0: vfio 0000:88:00.0: failed to setup container for group 124: memory listener initialization failed: Region pc.ram: vfio_dma_map(0x5642f2946b30, 0x100000000, 0x6e700000, 0x7f7d81800000) = -12 (Cannot allocate memory)')","timestamp":"2022-09-14T03:29:34.008228Z","uid":"10bbe548-8e28-466b-8f8d-46f99d6a4a65"} 
compute {"component":"virt-launcher","kind":"","level":"error","msg":"Failed to sync vmi","name":"centos-gpu","namespace":"default","pos":"server.go:184","reason":"virError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2022-09-14T03:29:33.740944Z qemu-kvm: -device vfio-pci,host=0000:88:00.0,id=ua-gpu-gpu1,bus=pci.5,addr=0x0: VFIO_MAP_DMA failed: Cannot allocate memory\n2022-09-14T03:29:33.773415Z qemu-kvm: -device vfio-pci,host=0000:88:00.0,id=ua-gpu-gpu1,bus=pci.5,addr=0x0: vfio 0000:88:00.0: failed to setup container for group 124: memory listener initialization failed: Region pc.ram: vfio_dma_map(0x5642f2946b30, 0x100000000, 0x6e700000, 0x7f7d81800000) = -12 (Cannot allocate memory)')","timestamp":"2022-09-14T03:29:34.008342Z","uid":"10bbe548-8e28-466b-8f8d-4 6f99d6a4a65"}

And the kernel log has this:

# dmesg
...
[415942.683580] vfio_pin_pages_remote: RLIMIT_MEMLOCK (5555355648) exceeded

What you expected to happen: I think the vm with GPU passthrough can run normally in spite of the memory request and limit, just like the vm without GPU passthrough.

How to reproduce it (as minimally and precisely as possible): use the vm definition in the above.

Additional context: VM with GPU passthrough can run normally when the memory request == limit.

Environment:

KubeVirt version (use virtctl version): v0.50.0
Kubernetes version (use kubectl version): v1.23.4
VM or VMI specifications: in the above
Cloud provider or hardware configuration: N/A
OS (e.g. from /etc/os-release): debian 11
Kernel (e.g. uname -a): 5.10.0-16-amd64
Install tools: N/A
Others: N/A

Sep 15 '22 03:09 caohuilong

@booxter Hello, Can you look into the problem? I have noticed that the vm attached vfio will adjust the MEMLOCK, but I don't know how to fix this.

Sep 15 '22 03:09 caohuilong

@caohuilong The way I have been getting around this on kubevirt 0.54 has been adding SYS_RESOURCE to the virt-launcher Pod's securityContext: https://man7.org/linux/man-pages/man7/capabilities.7.html#:~:text=on%20other%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20devices.-,CAP_SYS_RESOURCE,-*%20Use%20reserved%20space

Sep 18 '22 14:09 tlehman

It looks like this is related to https://github.com/kubevirt/kubevirt/pull/8367 which was merged last month.

To clarify if the limits are the root cause of this issue, can you please re-test after only removing the limit parameter from this config?

Sep 21 '22 14:09 usrbinkat

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Dec 20 '22 14:12 kubevirt-bot

I have the same problem in v0.55.0 of kubevirt.

# dmesg
[  +0.001779] vfio_pin_pages_remote: RLIMIT_MEMLOCK (65536) exceeded

# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1542940
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65535
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1542940
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Dec 29 '22 16:12 wllenyj

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

Jan 28 '23 16:01 kubevirt-bot

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

Feb 27 '23 17:02 kubevirt-bot

@kubevirt-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Feb 27 '23 17:02 kubevirt-bot

VM with GPU passthrough can't start while memory request < limit