nvidia-smi on WSL side causes NVRM driver non-paged pool memory leak
Windows Version
Microsoft Windows [Version 10.0.22631.3737]
WSL Version
WSL version: 2.4.13.0
Are you using WSL 1 or WSL 2?
- [x] WSL 2
- [ ] WSL 1
Kernel Version
Custom: Linux version 6.11.0-WSL2-STABLE+
Distro Version
Ubuntu 22.04
Other Software
Nvidia RTX 3090, driver version 566.03
Repro Steps
- Open PoolMon. Note the non-paged pool usage of NVRM (the Nvidia driver). I started at 23MB allocated.
- (Control run) Run
nvidia-smi -q -xa lot of times on Windows side, using a command like1..1000 | ForEach-Object { Start-Process -NoNewWindow -FilePath "nvidia-smi.exe" -ArgumentList "-q", "-x" }. Note the memory usage of NVRM in PoolMon. - (Test run) Run the same command 1000 times on the WSL side (I used
for i in {1..1000}; do nvidia-smi -q -x & done). Check the memory consumption in PoolMon.
Expected Behavior
I would expect that running nvidia-smi any number of times should not result in more memory being permanently allocated.
Actual Behavior
On the Windows side, NVRM's non-paged pool consumption ended up peaking at 96MB and then went back down to 23MB, which is as I would expect.
On the WSL side, memory consumption ended up going up to 36MB and staying there. This is problematic for anything relying on nvidia-smi for monitoring GPU activity, and there may be other things causing leaks with a common root cause.
I have observed similar memory leaks for over a year on my setup, across different WSL versions, different drivers, and different kernels, though this is the first one that I have had as a minimal reproducible example. Often I've ended up with as much as 20 GB of memory used by the Nvidia driver alongside general system instability.
Diagnostic Logs
No response
Logs are required for review from WSL team
If this a feature request, please reply with '/feature'. If this is a question, reply with '/question'. Otherwise please attach logs by following the instructions below, your issue will not be reviewed unless they are added. These logs will help us understand what is going on in your machine.
How to collect WSL logs
Download and execute collect-wsl-logs.ps1 in an administrative powershell prompt:
Invoke-WebRequest -UseBasicParsing "https://raw.githubusercontent.com/microsoft/WSL/master/diagnostics/collect-wsl-logs.ps1" -OutFile collect-wsl-logs.ps1
Set-ExecutionPolicy Bypass -Scope Process -Force
.\collect-wsl-logs.ps1
The script will output the path of the log file once done.
If this is a networking issue, please use collect-networking-logs.ps1, following the instructions here
Once completed please upload the output files to this Github issue.
Click here for more info on logging If you choose to email these logs instead of attaching to the bug, please send them to [email protected] with the number of the github issue in the subject, and in the message a link to your comment in the github issue and reply with '/emailed-logs'.
Diagnostic information
.wslconfig found
Detected appx version: 2.4.13.0
Another possible instance of this bug is happening whenever I initialize a CUDA context using PyTorch. Simply loading an interactive Python shell and using:
>>> import torch
>>> x = torch.randn((1000,1000), device='cuda')
>>> exit()
appears to increase the difference in nonpaged pool allocs vs. frees by 13. nvidia-smi -q -x appears to increase it by 43 each time. These tests I did WSL 2.5.4 with the standard kernel for that version.
I'm having this issue as well, running a Nosana node inside WSL, non-paged pool reaches >3.5GB on my 16GB system in 3-4 days and it becomes practically unusable, even after shutting WSL down
Im running a 3070 on Ubuntu 22.04, updated drivers multiple times, had this issue since August 2024 at least