DRM GTT/Shared memory broken in Nvidia Open drivers on Linux
NVIDIA Open GPU Kernel Modules Version
565.77-1 (but do affect every version of it)
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
- [ ] I confirm that this does not happen with the proprietary driver package.
Operating System and Version
Arch Linux
Kernel Release
6.12.6-arch1-1 (but affects every version of linux kernel, at least all 6.x.x)
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
- [x] I am running on a stable kernel release.
Hardware: GPU
NVIDIA GeForce RTX 4060 Laptop GPU (AD107-B)
Describe the bug
Just as described in #663 #618 https://forums.developer.nvidia.com/t/vram-allocation-issues/239678 https://forums.developer.nvidia.com/t/non-existent-shared-vram-on-nvidia-linux-drivers/260304
The standard DRM functionality GTT support is broken in nvidia-open modules and that made it impossible to use Shared Memory in Linux with nvidia gpus. That's not a minor missing feature, but a major functional bug which strongly affected every Linux user with a Nvidia gpu. #663 is closed in error, as described by @martynhare in #663#issuecomment-2487194834, so it's kinda wierd for you to ignore it when this caused a lot of games, Xorg, wayland, pytorch and many other ai related stuffs to crash and complain when there's absolutely enough RAM for them.
To Reproduce
Just use the latest edition of nvidia-open module, and it exists there.
nvidia-uvm won't help at all, and it's hard to find something using uvm in 2024.
Many ai stuffs doesn't support uvm at all, or has a uvm branch which is unmaintained for years.
For games, well, nvidia-uvm is only for cuda. Some of them can use dxvk which support to use system ram, but it's not a general solution and didn't fixed the problem at all.
Bug Incidence
Always
nvidia-bug-report.log.gz
nvidia-bug-report.log.gz doesn't help at all. It's a wide-affected bug in every version of nvidia-open in any environment.
Since a bot will close the issues without a nvidia-bug-report.log.gz, I'll upload a dummy one.
nvidia-bug-report.log.gz
More Info
Please fix this bug, it existed for years and caused pain on plenty of linux users who owns a nvidia gpu.
I confirm this is a major issue not resolved. Nvidia, stop playing the fool! Nvidia, stop playing the fool! Nvidia, stop playing the fool!
If you don't like open source, if you like Windows only, close this repo and make your big money!
It is worth noting that this also cripples the performance of their low vram mobile and embedded GPU lines the most. Mobile and embedded GPU performance under Linux becomes unusable compared to every other offering.
If Nvidia wants to do proper embedded/edge computing with platforms like the Jetson Orin Nano they should have fixed this issue ages ago as it is severely crippling those product lines. I can't see this as anything more than an honesty a pretty baffling decision on their part. AMD and to a lesser extent Intel and ARM are about to eat their lunch in that market. If Nvidia wants to maintain it's dominance in the AI market they need to implement this feature and soon. Else they will get undercut and eventually toppled on the integration side of the market by competitors.
So to echo @linuxiaobai 's sentiment in a more productive manner. Nvidia, what are you doing???
Can confirm that this is still an issue on 570.124.04, driver says it has support for HMM to be able to spill in system ram, yet it does not function and on certain games once the vram is filled it can't use system ram therefore performance goes into 5-9fps slideshow.
EDIT: Just in case it is needed when you run glxinfo -B on a dedicated amd/intel gpu you will see that it shows the total available actually being the system ram and above it is the vram, on nvidia on both show only vram and vram as the only usable thing.
Collapsed (see follow-up comments)
Just to clarify, this feature that works on Windows being discussed here is allowing VRAM to be paged to system memory to allow for allocating something else into VRAM? (which may also be swapping back pages that were previously paged out to system memory)
It's not a feature that extends the VRAM allocation capacity, such that if you need more than 8GB VRAM for a workload to be successful, additional system memory will not allow you to handle a workload that requires 12GB VRAM to run (a larger LLM model for example).
The way the feature works is if VRAM is allocated for another workload that's not actively being processed, that allocation can be moved from VRAM to system RAM when the extra VRAM is needed to handle another active workload. This is a distinction worth noting 😅 (a system with 128GB RAM does not suddenly augment a GPU with 4GB VRAM to run LLM models that would require 32GB VRAM)
Just to clarify, this feature that works on Windows being discussed here is allowing VRAM to be paged to system memory to allow for allocating something else into VRAM? (which may also be swapping back pages that were previously paged out to system memory)
It's not a feature that extends the VRAM allocation capacity, such that if you need more than 8GB VRAM for a workload to be successful, additional system memory will not allow you to handle a workload that requires 12GB VRAM to run (a larger LLM model for example).
The way the feature works is if VRAM is allocated for another workload that's not actively being processed, that allocation can be moved from VRAM to system RAM when the extra VRAM is needed to handle another active workload. This is a distinction worth noting 😅 (a system with 128GB RAM does not suddenly augment a GPU with 4GB VRAM to run LLM models that would require 32GB VRAM)
@polarathene Actually yes GTT does do exactly what you say it doesn't do. Source i do it myself on my systems. It does allow loading a model larger than the vram that can be being run exclusively on the card. Doing that results in a lot of vram page swapping between ram and vram and that's not super efficient but it would allow you to do what you say it doesn't do. That is unless your application bypasses Virtual Memory Management somehow and allocates contiguous blocks of memory instead of using pages like regular applications do.
Nvidia's HMM however does not currently do that. it is just related memory management idea/tech that could be extended to function like GTT. So it is no surprise that it is not working despite HMM being enabled. Unless they change the way HMM functions.
HMM requires additional application layer support to work (cuda stuff). Whereas with how GTT is implemented it is pretty much exclusively in the kernel/driver meaning that applications do not need to specifically support it for it to function.
TLDR: Yes you can Use GTT to load a 32GB model onto a 4GB card if you have enough system ram. It effectively allows using system ram as Vram. Bar very specific exceptions.
Nvidia's current HMM implementation however does not allow you do that. although you could do that with sufficient application level support for it. I don't however have a good example with that working.
The reason this not being implemented yet is baffling is precisely because it is practically the magic bullet to solve the problem of running out of vram. And that the competition's drivers implemented it ages ago. And that solution is open source available for everybody in the world to read in amd's implementation. And there are other additional advantages to being able to do this that are not immediately obvious.
Collapsed (see follow-up comments)
Actually yes GTT does do exactly what you say it doesn't do. Source i do it myself on my systems. It does allow loading a model larger than the vram that can be being run exclusively on the card.
Can you clarify please. Are you actually allocating more memory than the VRAM capacity allows for to perform a computation?
Or is your workload able to operate on a slice of that, such as with LLM models where you have the model split across system and GPU memory as the actual constraint is the layer width? If that layer width does not fit into the GPU memory, it should fail.
Whereas with how GTT is implemented it is pretty much exclusively in the kernel/driver meaning that applications do not need to specifically support it for it to function.
Right... so like I said with allocations being permitted when the workload can operate with VRAM and page out anything not actively required.
Yes you can Use GTT to load a 32GB model onto a 4GB card if you have enough system ram. It effectively allows using system ram as Vram. Bar very specific exceptions.
Do you have an example I can run locally, because when I have tried to load models too large for my VRAM capacity, despite having plenty of system RAM, the allocation would fail.
I'd put one together myself for a reproduction, but I'm a bit tight on time atm 😓
I have Windows and have tried running some projects in WSL2 with Docker where I observed system memory being used when my VRAM was lacking, but I also recall when that would fail despite having sufficient RAM spare.
As this issue is comparing existing Windows support vs Linux, I shouldn't have encountered the allocation failure if what you're stating is true 🤷♂ This isn't my area of expertise, so I'm happy to be corrected but I'm pretty sure allocation is constrained by the VRAM capacity for what kind of workload you can process on the GPU.
I have Windows and have tried running some projects in WSL2 with Docker where I observed system memory being used when my VRAM was lacking, but I also recall when that would fail despite having sufficient RAM spare.
That is to be expected because of your specific scenario.
On Native Windows with consumer drivers: For non-CUDA workloads the total limit across all processes combined is 100% of VRAM plus up to 50% of system RAM, after which the allocation will fail. This is managed by Windows, not the NVIDIA driver. For CUDA workloads (on normal consumer NVIDIA cards), the drivers need to be configured to allow system RAM use from within NVIDIA Control Panel and then the theoretical limit is the same, but in practice may fail sooner because Windows 11 happily prioritises visible foreground processes over invisible background ones and NVIDIA’s consumer drivers are at the mercy of WDDM.
On WSL2 with consumer drivers on desktop Windows 11: Less than 100% VRAM will be available to Linux as well as less than 50% system RAM usage due to paravirtualization (PV) overheads and host resource use. NVIDIA employees had to write a special driver for WSL2 to make everything work and due to NDAs, relevant employees in the forums have been unable to explain why you cannot use anywhere near as much system RAM even in situations where a separate GPU is used for host VRAM. But it is expected and is not an inherent Linux limitation.
I hope this helps explain why you’re seeing what you’re seeing @polarathene
NVIDIA only needs to implement basic support for utilising system RAM on native Linux in this open kernel module, not WSL2.
I hope this helps explain why you’re seeing what you’re seeing @polarathene
Sorry no that didn't help understand what I've observed. (EDIT: See my follow-up comment, I've been discussing the nvidia feature support from a different perspective: Windows+CUDA vs Linux+ROCm/GTT that others on this issue seem to be relating to instead)
Collapsed
On Native Windows with consumer drivers: For non-CUDA workloads the total limit across all processes combined is 100% of VRAM plus up to 50% of system RAM, after which the allocation will fail. This is managed by Windows, not the NVIDIA driver.
I am aware of this one, but I wasn't aware the lack of involvement with nvidia's driver.
For CUDA workloads (on normal consumer NVIDIA cards), the drivers need to be configured to allow system RAM use from within NVIDIA Control Panel and then the theoretical limit is the same, but in practice may fail sooner because Windows 11 happily prioritises visible foreground processes over invisible background ones and NVIDIA’s consumer drivers are at the mercy of WDDM.
Probably because I was thinking about this one with CUDA when making my statements above, as this is what I have heard complaints around with support lacking on linux.
On WSL2 with consumer drivers on desktop Windows 11: Less than 100% VRAM will be available to Linux as well as less than 50% system RAM usage due to paravirtualization (PV) overheads and host resource use. NVIDIA employees had to write a special driver for WSL2 to make everything work
I am aware that WSL2 instances have 50% system RAM limitation of their own (at least by default), but I've not tried looking into CUDA alloc feature being to the host memory or counted as used in the WSL2 instance.
I know just reading an LLM model to load will fill up the file/buffer cache in WSL2 and that the Windows host will consider that as allocated memory despite it being disposable (pretty sure I've had OOM events on the host before due to this, especially since on low disk it can exhaust disk space as Windows pages to disk). I can flush that from within WSL2 to free that up.
This doesn't change my observations though, that despite having 16GB for WSL2, I cannot seem to use any models that would not fit into VRAM when processing such as layer width IIRC.
If I have 8GB VRAM and say 2GB of that is already used by the host, and system memory of 32GB is mostly free, AFAIK running a CUDA workload that requires 10GB VRAM to process will fail, unless it doesn't need to have that all allocated in VRAM to perform a computation such that it can page through system memory. IIRC that allows for making smaller allocations like 5GB + 5GB, but would fail for 10GB, but 7-8GB would be the limit.
If I can find the time to, I'll look into writing a basic CUDA program to verify that but if anyone else has a reproduction I could use instead that'd be great. I am aware of software like llama.cpp which has the ability to offload layers to CPU and it does that implicitly if all layers are not explicitly forced to VRAM.
So far though, without a reproduction for clarity I can only go with my own observations. If someone in this thread is more knowledgeable and certain I'm mistaken, by all means assist with a basic reproduction that has a CUDA program allocate more than VRAM capacity for a computation.
due to NDAs, relevant employees in the forums have been unable to explain why you cannot use anywhere near as much system RAM even in situations where a separate GPU is used for host VRAM. But it is expected and is not an inherent Linux limitation.
FWIW: I'm not making any statements about linux limitations, I'm fully with you all on Linux gaining parity with this feature that is available on Windows. I prefer to use linux myself, so the reduced flexibility from lacking this feature is discouraging.
NVIDIA only needs to implement basic support for utilising system RAM on native Linux in this open kernel module, not WSL2.
I'm not sure if there's much specific to WSL2 there, I still need to install and use the CUDA installed version from the Windows host, upgrading CUDA requires stopping WSL2 to use CUDA again and nvidia-smi will then report the new driver.
So for the most part AFAIK the WSL2 support is mostly a bridge to the Windows host support, similar to Mesa's Venus driver for Vulkan or the native context for AMD (allowing linux guests to use the linux host AMD driver/hardware without the perf loss).
How is this related to getting GTT support for the Nvidia driver on Linux? Debugging your specific application on Windows or WSL and explaining the specifics of how to use GTT/CUDA/HMM/AI is not the point of this issue @polarathene . Please have this conversation somewhere where it doesn't muddy the water about this important problem with the Nvidia drivers for Linux. I suggest opening a separate issue where appropriate and perhaps linking back to this issue if you think that's important to do. For example Creating a github issue with the actual application that doesn't allow you to properly utilize GTT/shared memory capability.
How is this related to getting GTT support for the Nvidia driver on Linux?
Perhaps I'm mistaken. (EDIT: Appears so, I'll hide my messages to minimize noise)
Collapsed
I was under the impression this issue was about the allocation behaviour discussed above that's exclusive to Windows, and that this issue was requesting Linux to support that same feature.
My comments have been seeking clarity on the expected behaviour, since if it was related to how CUDA leverages system RAM on Windows when VRAM is insufficient, from my own observations and understanding it seemed misunderstood here.
The original issue description specifically notes:
pytorch and many other ai related stuffs to crash and complain when there's absolutely enough RAM for them.
You also stated:
If Nvidia wants to maintain it's dominance in the AI market they need to implement this feature and soon.
You then respond to me with:
Actually yes GTT does do exactly what you say it doesn't do. Source i do it myself on my systems. It does allow loading a model larger than the vram that can be being run exclusively on the card. Doing that results in a lot of vram page swapping between ram and vram and that's not super efficient but it would allow you to do what you say it doesn't do.
Yes you can Use GTT to load a 32GB model onto a 4GB card if you have enough system ram. It effectively allows using system ram as Vram. Bar very specific exceptions.
This seems well aligned with the topic I've been engaging in above, so clearly related. The only difference is I have a conflicting experience to what you're stating works.
Debugging your specific application on Windows or WSL and explaining the specifics of how to use GTT/CUDA/HMM/AI is not the point of this issue @polarathene. Please have this conversation somewhere where it doesn't muddy the water about this important problem with the Nvidia drivers for Linux.
Without a minimal reproduction with CUDA that works on Windows with the feature you're wanting to see on Linux, how are you so confident that you're understanding this feature correctly vs my own observations?
If you're actually writing CUDA software, you'd be able to confirm this. From the sounds of it you're like me with minimal experience in developing for GPU compute, where your experience is user-facing.
All I've done in this thread has been to try clarify that there may be a misunderstanding of the functionality you've described as working whilst dismissing my own without any actual resource/evidence that I can use to verify your claims.
When I can find the time to spare, I'll look into creating a reproduction example to reference. Allocating an array that's 8GB in size for a GPU with 4GB VRAM on a Windows host with 32GB RAM should be sufficient? According to you that would be successful, while I would expect it to fail.
- If it were instead 4 separate arrays at 2GB each, I would understand that to work with this feature swapping to system RAM and when the program needs to operate on one of those 2GB would be swapped into VRAM.
- This is specifically about the "system fallback policy" on Windows that CUDA supports (for
cudaMalloc()AFAIK). Not other API calls likecudaHostMalloc()which explicitly allocates host memory (that also has a 50% limit).
Apologies, I've reviewed the discussion again and while I thought this was referencing the Windows specific support nvidia has with CUDA (to achieve similar transparent usage of system memory as a fallback when VRAM is insufficient)... I see the request is more focused on parity with existing linux GPU vendor driver support for this functionality, rather than whatever the nvidia Windows driver is doing 🤷♂ (which is apparently equivalent to DRM GTT on linux)
Thus if I demonstrate a CUDA program failing to allocate an array larger than the VRAM capacity, it would not be relevant to you I take it?
I would assume a similar limitation with AMD/Intel GPUs even with the GTT support, but I cannot comment on that. I'll bow out from the discussion then as I don't have anything useful to contribute to it (unless I run linux as the host with an AMD/Intel GPU to attempt a similar allocation I described).
It looks like the 580 drivers on Linux (proprietary and open kernel module) now have a very limited form of HMM support which only some applications can make use of. But this ticket still shouldn't be closed as it's not true shared memory support.
The technical requirements to make use of this support appear to be:
- Having an NVIDIA card with Resizeable BAR support (at least 3000 series?)
- Your CPU needs to be at least Intel 10th Gen or Ryzen 5000 series to use this
- Resizeable BAR must be enabled (and legacy BIOS option ROM support disabled)
- Stars and planets must align, as sometimes doing this doesn't work because Linux (Sometimes ReBAR needs to be disabled in UEFI to enable it on Linux... weird!)
The 580 drivers appear to have three big limitations:
- They will still allow the framebuffer memory to become so full applications can't launch
- No single application can have access to more VRAM than the GPU has in total
- Compute processes cannot be transparently offloaded from GPU VRAM to system RAM
That third limitation means any video game executed in Proton can only use GPU VRAM and nothing else, as they all show up as using CUDA in nvidia-smi. It also means Chromium, Firefox, Steam Electron and other web browser processes with accelerated video decoding will still steal precious VRAM from Proton games.
However, we should nonetheless congratulate NVIDIA on finally getting some rudimentary support in place.
Hopefully by the time the 600 series drivers release, they'll have support for reliably offloading to system RAM for all scenarios, even if system RAM cannot supplement beyond what the card offers in actual VRAM. That would be a reasonable interim goal.
Ideally, they will implement proper DRM GTT for their open drivers, and leave the old implementation for their proprietary module (which can't use GPL symbols, among other licence limits).