open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

NO SHARED MEMORY FOR YEARS [NVIDIA_UVM] - BASIC FEATURE

Open bioluks opened this issue 1 year ago • 2 comments

NVIDIA Open GPU Kernel Modules Version

550.90.07 (latest)

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • [ ] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Multiple Setups (10+), for now on Arch

Kernel Release

multiple ones, right now on "6.9.3-hardened1-1-hardened"

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [ ] I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce GTX 1050 Ti

Describe the bug

This is ignored everywhere by NVIDIA employees and devs. Since 2016 we have no solution (I'm sure it was like this even before 2016). Do we need viral tweets and Reddit posts here and there bashing the company so they listen to us at all?

NVIDIA_UVM is not working even when loaded, checked via lsmod, also on a 30 series RTX. "nvidia-modprobe" does nothing. There is no dmesg to show since everything loads successfully. If the VRAM is full there is no backup option (no shared RAM like in Windows systems). We have high end graphics cards with very low VRAM, and it's slowly starting to become a fact they were produced this way on purpose.

I'm obviously annoyed. It's 2024. All other known GPU brands (AMD , Intel) don't have this issue; shared memory works just fine. It's a basic feature that should just work, just like in Windows. The NVIDIA driver still has the most annoying issues on Linux, we know you don't care about Linux users. Wayland issues, late incoming optimus support on laptops etc, you name it. If you hate open source this much don't publish the driver at all and stop further updates. From now on I will vote with my wallet (I know this won't change anything), the internet is begging you for bug fixes and you not caring just shows how you all think we have no alternative out there. For anyone here looking for fixes (there are none at the moment) check out:

  1. NVIDIA Forum Post from 2016 about this very issue
  2. Same issue on a 2023 NVIDIA Forum post with details
  3. Someone also raised this issue in the Discussions, but again. Dead silence.

No error logs are needed at this point, it's known shared memory (nvidia_uvm - unified shared memory) simply does not work.

If you don't want to buy an expensive GPU from NVIDIA, your only bet is to use Windows so your Games/Apps do not crash twhen your VRAM is full. The nvidia_uvm you see in lsmod acts like a placeholder for an empty file. Buy an AMD or Intel GPU for now. Like Linus said this is the worst company they had to deal with.

So the question is when this advertised as working feature of yours will start to work at all?

To Reproduce

Just install the latest proprietary driver and for once test the driver yourself as a dev. NVIDIA_UVM does not work, and if it works you used hidden parameters not known to us. Like mentioned below the nvidia-bug-report.sh script does not work, no matter which parameter passed.

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.sh is not working no matter what I do, tried the safe mode parameter, reboot etc. Of course ran as root. You have bigger issues if this is even hanging. Since I do not know if a bot/AI manages these issues I will upload an empty log.gz file. nvidia-bug-report.log.gz

More Info

You know the problem better than me. Please check the links I posted. Important forum posts like these should at least get an answer.

bioluks avatar Jun 14 '24 13:06 bioluks

I bought multiple nvidia cards for a business and they are in rubbish bin now. I have to use windows or i have to use AMD or INTEL cards instead of this rubbish cards They dont have this feature and they wont in a short time. They dont care non profit developments DONT BUY NVIDIA

cngkyt avatar Jun 26 '24 21:06 cngkyt

This issue is still getting ignored like I said before. I wonder how long NVIDIA will dodge enabling this feature we should now have for years. It seems the AI wave made them ignore everything else. We are not even getting answers here.

I don't think anyone will buy the "lacking manpower/budget/time" argument anymore looking at the NVIDIA profits for the last 6 months.

We won't be running Windows servers, there are always alternatives.

bioluks avatar Jul 15 '24 13:07 bioluks

Encountering the same issue as well. I am using KDE 6.1 with Wayland, and launching any 3d game that uses a lot of VRAM causes this issue, works fine on Windows 10.

Error log:

Sep 16 09:11:37 laptop-misha brave[12123]: src/gbm_drv_common.c:131: GBM-DRV error (get_bytes_per_component): Unknown or not supported format: 808530000
Sep 16 09:11:37 laptop-misha brave[12123]: src/gbm_drv_common.c:131: GBM-DRV error (get_bytes_per_component): Unknown or not supported format: 808530000
Sep 16 09:11:37 laptop-misha brave[12123]: src/gbm_drv_common.c:131: GBM-DRV error (get_bytes_per_component): Unknown or not supported format: 808530000
Sep 16 09:11:37 laptop-misha brave[12123]: src/gbm_drv_common.c:131: GBM-DRV error (get_bytes_per_component): Unknown or not supported format: 808530000
Sep 16 09:11:37 laptop-misha brave[12123]: src/gbm_drv_common.c:131: GBM-DRV error (get_bytes_per_component): Unknown or not supported format: 808530000
Sep 16 09:11:37 laptop-misha kwin_wayland[11127]: kf.windowsystem: static bool KX11Extras::mapViewport() may only be used on X11
Sep 16 09:11:38 laptop-misha wpa_supplicant[1041]: wlp2s0: CTRL-EVENT-SIGNAL-CHANGE above=0 signal=-81 noise=9999 txrate=103200
Sep 16 09:11:38 laptop-misha kwin_wayland[11127]: kwin_scene_opengl: 0x501: GL_INVALID_VALUE error generated. <levels>, <width> and <height> must be 1 or greater.
Sep 16 09:11:38 laptop-misha kwin_wayland[11127]: kwin_scene_opengl: Invalid framebuffer status:  "GL_FRAMEBUFFER_INCOMPLETE_ATTACHMENT"
Sep 16 09:11:38 laptop-misha kwin_wayland[11127]: kwin_scene_opengl: 0x502: GL_INVALID_OPERATION error generated. Framebuffer name must be generated before being bound.
Sep 16 09:11:38 laptop-misha kernel: [drm:nv_drm_gem_alloc_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NVKMS memory for GEM object
Sep 16 09:11:38 laptop-misha kernel: [drm:nv_drm_gem_alloc_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NVKMS memory for GEM object
Sep 16 09:11:38 laptop-misha kernel: [drm:nv_drm_gem_alloc_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NVKMS memory for GEM object
Sep 16 09:11:38 laptop-misha kernel: [drm:nv_drm_gem_alloc_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NVKMS memory for GEM object
Sep 16 09:11:38 laptop-misha kernel: [drm:nv_drm_gem_alloc_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NVKMS memory for GEM object
Sep 16 09:11:38 laptop-misha kernel: [drm:nv_drm_gem_alloc_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NVKMS memory for GEM object
Sep 16 09:11:38 laptop-misha kwin_wayland_wrapper[11191]: src/nv_gbm.c:123: GBM-DRV error (nv_gbm_bo_create): DRM_IOCTL_NVIDIA_GEM_ALLOC_NVKMS_MEMORY failed (ret=-1)
Sep 16 09:11:38 laptop-misha kwin_wayland_wrapper[11191]: src/nv_gbm.c:123: GBM-DRV error (nv_gbm_bo_create): DRM_IOCTL_NVIDIA_GEM_ALLOC_NVKMS_MEMORY failed (ret=-1)
Sep 16 09:11:38 laptop-misha kwin_wayland_wrapper[11191]: src/nv_gbm.c:123: GBM-DRV error (nv_gbm_bo_create): DRM_IOCTL_NVIDIA_GEM_ALLOC_NVKMS_MEMORY failed (ret=-1)
Sep 16 09:11:38 laptop-misha kwin_wayland_wrapper[11191]: src/nv_gbm.c:123: GBM-DRV error (nv_gbm_bo_create): DRM_IOCTL_NVIDIA_GEM_ALLOC_NVKMS_MEMORY failed (ret=-1)
Sep 16 09:11:38 laptop-misha kwin_wayland_wrapper[11191]: src/nv_gbm.c:123: GBM-DRV error (nv_gbm_bo_create): DRM_IOCTL_NVIDIA_GEM_ALLOC_NVKMS_MEMORY failed (ret=-1)
Sep 16 09:11:38 laptop-misha kwin_wayland_wrapper[11191]: src/nv_gbm.c:123: GBM-DRV error (nv_gbm_bo_create): DRM_IOCTL_NVIDIA_GEM_ALLOC_NVKMS_MEMORY failed (ret=-1)

This is ridiculous of how NVIDIA is treating its customers. I will no longer buy anything with the word "NVIDIA" in it or recommend others this company. I highly doubt spending a little money to improve their Linux drivers would cause them to no longer be the "most valuable company in the world". However, who knows, maybe Microsoft is paying off NVIDIA to make their Linux drivers worse.

MishaProductions avatar Sep 16 '24 13:09 MishaProductions

cant play any multiplayer games because my computer randomly crashes and locks up when my 1500$ gpu runs out of vram, wish the 64gb of system memory thats always empty was useable by the nv driver so this stopped happening. def learned my lesson

Hellzbellz123 avatar Sep 26 '24 20:09 Hellzbellz123

Yeah, I learned my lesson too, never buying anything with the "nvidia" logo ever again.

MishaProductions avatar Sep 26 '24 20:09 MishaProductions

FWIW: nvidia_uvm does provide shared memory - for CUDA, that is:

NVIDIA Unified Memory kernel module (/lib/modules/uname -r/kernel/drivers/video/nvidia-uvm.ko); this kernel module provides functionality for sharing memory between the CPU and GPU in CUDA programs. It is generally loaded into the kernel when a CUDA program is started, and is used by the CUDA driver on supported platforms.

I don't think Nvidia claims otherwise anywhere? I guess from their perspective the Linux GPU compute market is generally more important than desktop one and I do wonder if NVK and the family will be the better option for non-CUDA usecases soon enough.

v1993 avatar Sep 27 '24 15:09 v1993

This doesn't appear to be an kernel issue, but with the nvidia drivers specifically. After switching to a computer with an AMD gpu, the issue is gone. When I used my laptop with an nvidia gpu, Nouveau worked pretty well for me and had better performance compared to the closed source junk, but USB-C DSC is unsupported.

MishaProductions avatar Sep 28 '24 20:09 MishaProductions

I think it's time to close this now (yes, yes I know).

As mentioned many times before, this is a repo for kernel modules, monitored by developers working on the kernel modules, and the only issues that belong here are bug reports relating to kernel modules. This is a feature request, rather than a bug report, and one that has no kernel component. nvidia_uvm.ko is not relevant here.

The proper place to make these requests is the forums, where the overall end user sentiment is collected and sent up the management chain to someone who has the power to prioritize work on a feature. The developers on this repo cannot do this.

mtijanic avatar Sep 30 '24 11:09 mtijanic

In my opinion, this bug report was closed in error.

This report is for a kernel component and it's not a feature request, it's people reporting that the new kernel module is lacking standard DRM functionality (namely GTT support). It would be similar to a bug report if the open kernel module failed to set the GPU clock speed beyond its initial performance state (user rightfully expects it to work).

NVIDIA used to have this working in the proprietary drivers for GeForce TurboCache support about 12 years ago and any long term buyer of NVIDIA's products expects this to still work on Linux.

martynhare avatar Nov 20 '24 02:11 martynhare

I confirm this is a major issue not resolved. Nvidia, stop playing the fool! Nvidia, stop playing the fool! Nvidia, stop playing the fool!

If you don't like open source, if you like Windows only, close this repo and make your big money!

linuxiaobai avatar Feb 04 '25 03:02 linuxiaobai

The proper place to make these requests is the forums, where the overall end user sentiment is collected and sent up the management chain to someone who has the power to prioritize work on a feature. The developers on this repo cannot do this.

https://forums.developer.nvidia.com/t/non-existent-shared-vram-on-nvidia-linux-drivers/260304 You mean the forums where the developers ghost literally everyone? You could maybe pass them the word, or share with us some internal infos that could lead us to believe that, at least, someone at this trillion dollar company is actually working on it?

awsms avatar Feb 07 '25 10:02 awsms

I don't think anyone will help at this point.

You having shared memory support on Linux means probably running your own LLM's locally now. Make no mistake - they do not want this. OpenAI builds datacenters for you all where they are deploying all enterprise GPU's from NVIDIA as we speak.

At least they could've delivered the advertised feature instead of pushing an alpha state driver (that does not even support Wayland out of the box in 2025 LET THAT SINK IN).

Hate to say it, but vote with your wallet. I do not think anyone in NVIDIA will work towards finishing their alpha state driver in the next 5 years.

Now I understand Linus.

bioluks avatar Feb 09 '25 02:02 bioluks

This has literally nothing to do with suppressing local LLMs - CUDA had shared memory support for a good while, it's specifically graphics applications that are getting short end of the stick. And really, this wouldn't even make much sense with most desktops and thus probably most people who'd like a try out a local LLM out there using Windows. Also, keep in mind that datacenters in question are likely running identical or nearly identical drivers to what is released to general public too - cloud is just powerful computers elsewhere, at the end of the day.

While both the lack of this feature and Nvidia's (effective lack of) response are baffling, I doubt there's some kind of grand conspiracy behind all of this other than it not being seen as a high-priority issue, with enterprise customers either having enough VRAM for everything or using CUDA and its memory sharing features.

P.S.: hate to break it, but I doubt desktop linux users are going to make much of a dent in Nvidia's income at this point even if they all never buy green side products again. That said, switching away for the sake of better experience sounds sensible.

v1993 avatar Feb 09 '25 11:02 v1993

LLM's went viral recently, obviously this is not the reason this issue is not fixed. Afaik we have this issue since 2016. The fact that they are keeping the VRAM low has multiple reasons. Stop focusing on this only, and calling out a possibility makes things a conspiracy now?

I know the average linux user won't be able to hurt NVIDIA. Stock prices exploded. We saw what happened with the latest 50 series (instant sellout). There is a brand loyalty that shouldn't exist in the first place - and this was a warning reminding it.

They have enough script kiddies having access to the source code, and I'm sure they could make it happen in a few weeks if they wanted. Like you said this is a priority issue where we are all the way down in this list.

I'll make the switch for a better experience.

bioluks avatar Feb 09 '25 12:02 bioluks

if you want a forum thread focusing on end-user impact of this, see https://forums.developer.nvidia.com/t/wayland-applications-freezing-sporadically-suspected-vram-issues/329684

PandorasFox avatar Apr 13 '25 03:04 PandorasFox

AI and a lot of forums suggest using export CUDA_MANAGED_FORCE_DEVICE_ALLOC=, but it does not work on Linux.

Finally, I found NVIDIA CANCELED the SUPPORT after CUDA 8.0. Now, there seems to be NO WAY to open shared/unified GPU+CPU/Main memory on the Linux servers and desktop computers.

Now I find this issue! It is not my single case!

So, Strongly Agree with Linus's word: NVIDIA F*ck y**!

Byron-Ding avatar Apr 24 '25 07:04 Byron-Ding