i915-sriov-dkms icon indicating copy to clipboard operation
i915-sriov-dkms copied to clipboard

Kernel dynamic memory is not released again

Open makoONE opened this issue 1 year ago • 13 comments

Since I have been using PVE 8.1 with kernel 6.5, I have noticed for some time that the kernel dynamic memory is not released again. Whenever I start a VM that has allocated the GPU and shut it down again, the host's memory display remains at about the same value as if the VM was still running. A check with smem shows that the memory is no longer allocated by any processes but to the kernel dynamic memory. With PVE 8.0 and kernel 6.2 I never experienced the described behavior.

Is anyone else here affected or knows a solution?

makoONE avatar Feb 02 '24 12:02 makoONE

Just adding that I've experienced the same kernel memory leak with kernel 6.5. For me it only happens when the host uses the iGPU for graphics output (my use case is a regular-use desktop PC with a Windows VM, not a Proxmox server) while virtual functions are enabled – even when no VM is using any virtual functions.

I never planned on passing through my dedicated NVIDIA card to the VM, so I just made sure the host uses the NVIDIA GPU always and never accesses the iGPU. Then virtual functions work fine without leaking kernel memory. I've since moved to kernel 6.6 but I don't know if the issue still persists since I'm happy with my current setup.

brussig-tud avatar Mar 06 '24 09:03 brussig-tud

@makoONE @brussig-tud Jeez I think I've been struggling with the same issue here for the past weeks. I initially thought this to be an LXC problem thus made a very elaborate post here: https://discuss.linuxcontainers.org/t/lxc-container-in-proxmox-using-90-of-memory-with-all-processed-killed/19389/4

Could you guys read my post and help me how I can check if there also kernel dynamic memory allocated? (What command do I find for this) With this I'd like to figure out if my problem is the same problem you guys are having.

And if this is the case so you have any idea how to disable sriov temporary? Do I uninstall the DKMS module or do I need to undo all steps?

devedse avatar Mar 19 '24 00:03 devedse

@devedse I'm not super knowledgable about containers and containerizing things. But if I read your post correctly, then you have SR-IOV enabled using this driver, but you don't actually use any virtual functions since you're not passing them on to VMs. Instead, you only actually use the SRIOV-enabled GPU from the host OS, since containers after all still technically run on the host.

So yeah, it very much sounds like you're facing the same issue. You can check your kernel dynamic memory usage using the smem utility:

sudo smem -twk

I don't have any output saved from when I tried, but my "kernal dynamic memory" value was 27GB once after just running a normal KDE desktop on the iGPU for about 2 hours with this module enabled.

Just dkms remove'ing the module will be enough to disable virtual functions temporarily. I did not have to do anything else to get rid of the memory leak, which pretty much proves that the i915-sriov driver is the culprit. You can always just dkms install it again later on if you need virtual functions back.

brussig-tud avatar Mar 19 '24 13:03 brussig-tud

@brussig-tud , that's exactly the answer I was looking for.

So I don't need to remove this from grub:

intel_iommu=on i915.enable_guc=3 i915.max_vfs=7

And also don't need to remove this file:

/etc/sysfs.conf

?

devedse avatar Mar 19 '24 13:03 devedse

@devedse The "vanilla" i915 driver will ignore the max_vfs kernel boot parameter, and the sysfs entry will just silently fail if the driver does not provide the endpoints, so yeah, you can leave them in place.

I don't remember whether just not creating VFs via sysfs was enough to fix the memory leak, or if you also had to set max_vfs=0, or if you had to completely disable GuC scheduling altogether (which should also cause this driver to not leak memory). You can try narrowing it down further like this, but removing the DKMS module will surely prove or disprove the hypothesis that this driver is causing your memory leak and you can leave the other things there in case you need them later.

brussig-tud avatar Mar 19 '24 13:03 brussig-tud

@brussig-tud , Thanks for the explanation.

To keep things further on topic, do you know any place to more casually discuss this stuff further? IRC/Discord? I'm curious what you all use SRIOV for.

Edit: Here's the output of smem -twk:

root@proxmox1:~# smem -twk
Area                           Used      Cache   Noncache 
firmware/hardware                 0          0          0 
kernel image                      0          0          0 
kernel dynamic memory         10.8G       6.5G       4.3G 
userspace memory              14.6G     774.0M      13.9G 
free memory                    5.7G       5.7G          0 
----------------------------------------------------------
                              31.1G      12.9G      18.2G

So indeed I also seem to be using quite some kernel dynamic memory.

devedse avatar Mar 19 '24 14:03 devedse

@devedse do you know any place to more casually discuss this stuff further? IRC/Discord? No idea, sorry... As for me, I just need a VM with a working virtualized GPU for cross-platform graphics development. But I don't want to pass through my whole NVIDIA GPU, and passing through the full iGPU usually doesn't work for Windows guests, whereas mapping a virtual function to the VM works really well.

In general, I think SR-IOV is mainly used on NICs as a sort of high-performance ethernet bridge for VMs.

brussig-tud avatar Mar 19 '24 16:03 brussig-tud

@brussig-tud , I just removed the dkms module and rebooted the system. Now the whole /dev/dri folder seems to be missing though. Am I missing the normal drivers or something to get the intel N100 working again?

I played around a bit and I found out that reverting to kernel 6.2 seems to solve the issue. Does the 6.5 kernel not actually have an i915 driver included?

devedse avatar Mar 19 '24 21:03 devedse

@devedse I have actually no experience with Proxmox whatsoever, but that seems very unlikely to me (after all every other Debian-based distro usually packages the i915 driver for every officially available kernel version). You can try to modprobe i915 on the 6.5 kernel and see if it tells you something.

\edit you should definitely check what driver is being assigned to the iGPU using lspci -nnk.

If everything else fails, keeping the i915-sriov DKMS driver with num_vfs=0 (and potentially disabled GuC scheduling, i.e. enable_guc=2) might get rid of the memory leak also. If you want fully accellerated hardware media encoding you need HuC firmware loading, so no enable_guc=1 or lower which would be the default if you omit the kernel parameter.

brussig-tud avatar Mar 20 '24 09:03 brussig-tud

Apparently the problem was that the "i915.ko" file seemed to be missing in the modules folder.

I had to reinstall the kernel by doing the following:

dpkg --search /usr/lib/modules/<kernel version directory>

apt-get --reinstall install proxmox-kernel-6.5.13-1-pve-signed

That fixed my issues

devedse avatar Mar 20 '24 14:03 devedse

yes, I also encountered this issue, but after I rolled back the PVE kernel to 6.2.16-20-PVE, the memory usage was normal, and SRIOV could also be used normally.

gfgjs avatar Apr 08 '24 21:04 gfgjs

Could this be linked to this issue https://patchwork.kernel.org/project/intel-gfx/patch/BYAPR03MB4168C6D020B750EAF8021731ADE22@BYAPR03MB4168.namprd03.prod.outlook.com/ ?

azerty9971 avatar May 19 '24 02:05 azerty9971

image Proxmox 8.2.5 with kernel 6.8.12-2-pve

paulzzh avatar Sep 21 '24 03:09 paulzzh

Can someone test that PR #204 fixes this issue, thanks.

bbaa-bbaa avatar Oct 03 '24 11:10 bbaa-bbaa

No leak in arch 6.11.1 after replacing drivers/gpu/drm/i915/gem/i915_gem_shmem.c with the original 6.11.1-arch1-1 version.

vit0sd avatar Oct 03 '24 16:10 vit0sd

Can someone test that PR #204 fixes this issue, thanks.

Proxmox 8.2.7 kernel 6.8.12-2-pve Tested for 12 hours. No leak with pr #204 .

paulzzh avatar Oct 03 '24 17:10 paulzzh