WSL2 CUDA Does Not Respect `CUDA - Sysmem Fallback Policy`
Windows Version
Microsoft Windows [Version 10.0.22635.3061]
WSL Version
2.1.0.0
Are you using WSL 1 or WSL 2?
- [X] WSL 2
- [ ] WSL 1
Kernel Version
5.15.137.3-1
Distro Version
Ubuntu 22.04
Other Software
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.36 Driver Version: 546.33 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 On | Off |
| 0% 38C P8 16W / 450W | 1528MiB / 24564MiB | 8% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 38 G /Xwayland N/A |
+---------------------------------------------------------------------------------------+
torch 2.3.0.dev20231227+cu121
Repro Steps
With a recent version of NVIDIA GPU Driver, in Windows NVIDIA Driver Settings, disable CUDA - Sysmem Fallback Policy globally and restart the computer.
In WSL terminal, execute the following command to allocate a tensor that should exceed the overall GPU memory available (30 GiB required VS 24 GiB available).
python3 -c 'import torch; x = torch.ones((30, 1024, 1024, 1024), dtype=torch.uint8, device="cuda"); print(x.device)'
Instead of throwing an OOM exception, the command executes successfully.
However, the expected behaviour is to throw an OOM exception so that some applications like StableDiffusion WebUI can detect that the inadequacy of the available GPU memory and choose a memory efficient way to do inference. The current behaviour just uses the slow fallbacked CPU memory, which causes the inference to be really slow.
The total available GPU memory showed by torch also looks weired.
python3 -c 'import torch; print(torch.cuda.get_device_properties(0).total_memory)'
Expected Behavior
Executing the following command throws an OOM exception
python3 -c 'import torch; x = torch.ones((30, 1024, 1024, 1024), dtype=torch.uint8, device="cuda"); print(x.device)'
Actual Behavior
The command executes peacefully, which is not expected.
python3 -c 'import torch; x = torch.ones((30, 1024, 1024, 1024), dtype=torch.uint8, device="cuda"); print(x.device)'
Diagnostic Logs
No response
Hi I'm an AI powered bot that finds similar issues based off the issue title.
Please view the issues below to see if they solve your problem, and if the issue describes your problem please consider closing this one and thumbs upping the other issue to help us prioritize it. Thank you!
Open similar issues:
- CUDA Initialization Issue and Proposed Workaround with PyTorch in wsl2 (#10269), similarity score: 0.78
- No CUDA Runtime is Found on WSL2 (#9092), similarity score: 0.78
Closed similar issues:
- CUDA not working on WSL2 (#7336), similarity score: 0.84
- WSL2 & CUDA does not work [v20226] (#6014), similarity score: 0.83
- CUDA error in WSL (#6622), similarity score: 0.80
Note: You can give me feedback by thumbs upping or thumbs downing this comment.
@chengzeyi try running your command multiple times. In my case the OOM triggered on the 2nd run:
(phi-2-env) root@texas:/mnt/e/ai/phi-2# python
Python 3.9.18 (main, Sep 11 2023, 13:41:44)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> x = torch.ones((30, 1024, 1024, 1024), dtype=torch.uint8, device="cuda"); print(x.device)
cuda:0
>>> x.device
device(type='cuda', index=0)
>>> x = torch.ones((30, 1024, 1024, 1024), dtype=torch.uint8, device="cuda"); print(x.device)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.00 GiB. GPU 0 has a total capacty of 15.99 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 30.00 GiB is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
nvidia-smi output:
Sun Jan 21 10:04:33 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.36 Driver Version: 546.33 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A4000 On | 00000000:61:00.0 On | Off |
| 41% 33C P8 11W / 140W | 515MiB / 16376MiB | 3% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 35 G /Xwayland N/A |
+---------------------------------------------------------------------------------------+
@elsaco Yeah, that's the bug, the first run should trigger an OOM instead of the following runs. In your case 30GB x 2 is too large even for fallbacked sysmem, so it fails in the second run.
this is what I'm using to get around this issue
import torch
torch.cuda.set_per_process_memory_fraction(1.0, 0)
memory allocation now behave likes bare metal Linux, but on WSL2. of course, this only works for pytorch, but I rather have it than not at all. I hope the team work on this soon.
Just ran into this issue today and was briefly confused why my processing speed dropped like a stone when it looked like things were fitting into VRAM and I knew I had sysmem fallback off.
Thanks for the workaround, but I too hope to see action on this.
It is still a problem to this day. I could use the workaround as well, but is there a way to fix it without that?
To the best of my knowledge, no.
File it with the other WSL2 issues I have hoped to see a patch for for years, including better 9P file speed with Windows.