WSL icon indicating copy to clipboard operation
WSL copied to clipboard

WSL2 CUDA Does Not Respect `CUDA - Sysmem Fallback Policy`

Open chengzeyi opened this issue 1 year ago • 7 comments

Windows Version

Microsoft Windows [Version 10.0.22635.3061]

WSL Version

2.1.0.0

Are you using WSL 1 or WSL 2?

  • [X] WSL 2
  • [ ] WSL 1

Kernel Version

5.15.137.3-1

Distro Version

Ubuntu 22.04

Other Software

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.36                 Driver Version: 546.33       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:01:00.0  On |                  Off |
|  0%   38C    P8              16W / 450W |   1528MiB / 24564MiB |      8%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A        38      G   /Xwayland                                 N/A      |
+---------------------------------------------------------------------------------------+
torch                     2.3.0.dev20231227+cu121

Repro Steps

With a recent version of NVIDIA GPU Driver, in Windows NVIDIA Driver Settings, disable CUDA - Sysmem Fallback Policy globally and restart the computer.

image

In WSL terminal, execute the following command to allocate a tensor that should exceed the overall GPU memory available (30 GiB required VS 24 GiB available).

python3 -c 'import torch; x = torch.ones((30, 1024, 1024, 1024), dtype=torch.uint8, device="cuda"); print(x.device)'

Instead of throwing an OOM exception, the command executes successfully.

image

However, the expected behaviour is to throw an OOM exception so that some applications like StableDiffusion WebUI can detect that the inadequacy of the available GPU memory and choose a memory efficient way to do inference. The current behaviour just uses the slow fallbacked CPU memory, which causes the inference to be really slow.

The total available GPU memory showed by torch also looks weired.

python3 -c 'import torch; print(torch.cuda.get_device_properties(0).total_memory)'

image

Expected Behavior

Executing the following command throws an OOM exception

python3 -c 'import torch; x = torch.ones((30, 1024, 1024, 1024), dtype=torch.uint8, device="cuda"); print(x.device)'

Actual Behavior

The command executes peacefully, which is not expected.

python3 -c 'import torch; x = torch.ones((30, 1024, 1024, 1024), dtype=torch.uint8, device="cuda"); print(x.device)'

Diagnostic Logs

No response

chengzeyi avatar Jan 20 '24 03:01 chengzeyi

Hi I'm an AI powered bot that finds similar issues based off the issue title.

Please view the issues below to see if they solve your problem, and if the issue describes your problem please consider closing this one and thumbs upping the other issue to help us prioritize it. Thank you!

Open similar issues:

Closed similar issues:

Note: You can give me feedback by thumbs upping or thumbs downing this comment.

github-actions[bot] avatar Jan 20 '24 03:01 github-actions[bot]

@chengzeyi try running your command multiple times. In my case the OOM triggered on the 2nd run:

(phi-2-env) root@texas:/mnt/e/ai/phi-2# python
Python 3.9.18 (main, Sep 11 2023, 13:41:44)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> x = torch.ones((30, 1024, 1024, 1024), dtype=torch.uint8, device="cuda"); print(x.device)
cuda:0
>>> x.device
device(type='cuda', index=0)
>>> x = torch.ones((30, 1024, 1024, 1024), dtype=torch.uint8, device="cuda"); print(x.device)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.00 GiB. GPU 0 has a total capacty of 15.99 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 30.00 GiB is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

nvidia-smi output:

Sun Jan 21 10:04:33 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.36                 Driver Version: 546.33       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A4000               On  | 00000000:61:00.0  On |                  Off |
| 41%   33C    P8              11W / 140W |    515MiB / 16376MiB |      3%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A        35      G   /Xwayland                                 N/A      |
+---------------------------------------------------------------------------------------+

elsaco avatar Jan 21 '24 18:01 elsaco

@elsaco Yeah, that's the bug, the first run should trigger an OOM instead of the following runs. In your case 30GB x 2 is too large even for fallbacked sysmem, so it fails in the second run.

chengzeyi avatar Jan 22 '24 13:01 chengzeyi

this is what I'm using to get around this issue

import torch
torch.cuda.set_per_process_memory_fraction(1.0, 0)

memory allocation now behave likes bare metal Linux, but on WSL2. of course, this only works for pytorch, but I rather have it than not at all. I hope the team work on this soon.

Pipyakas avatar Mar 18 '24 07:03 Pipyakas

Just ran into this issue today and was briefly confused why my processing speed dropped like a stone when it looked like things were fitting into VRAM and I knew I had sysmem fallback off.

Thanks for the workaround, but I too hope to see action on this.

strawberrymelonpanda avatar May 13 '24 10:05 strawberrymelonpanda

It is still a problem to this day. I could use the workaround as well, but is there a way to fix it without that?

endoedgar avatar May 15 '25 04:05 endoedgar

To the best of my knowledge, no.

File it with the other WSL2 issues I have hoped to see a patch for for years, including better 9P file speed with Windows.

strawberrymelonpanda avatar May 15 '25 13:05 strawberrymelonpanda