stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

[Bug]: When running on Linux on laptop with both an Nvidia and integrated AMD GPU, I can't get SD to use Nvidia GPU.

Open Krobix opened this issue 2 years ago • 7 comments
trafficstars

Is there an existing issue for this?

  • [X] I have searched the existing issues and checked the recent builds/commits

What happened?

When running the command: ./webui.sh --listen --port 5000 --medvram --no-half --device-id 0 (using --device-id 0 here because there is only one CUDA device): This runs up until loading a model, at which point it hangs indefinitely, and the NVIDIA GPU is never accessed. I have tried adding prime-run (the command which would typically be used to run some programs on NVIDIA GPU) to the beginning of the command, but as far as I can tell, nothing changes. When I used CPU only, it ran just fine.

Steps to reproduce the problem

  1. Install SD Web UI on Linux Laptop with two GPU's using given instructions
  2. Download Models
  3. Try to run using varying different commands

What should have happened?

When I run the command, I should be able to load the SD model on my NVIDIA GPU instead of the integrated AMD GPU.

Commit where the problem happens

c12d7ddd725c485682c1caa025627c9ee936d743

What platforms do you use to access UI ?

Linux

What browsers do you use to access the UI ?

Mozilla Firefox

Command Line Arguments

./webui.sh --listen --port 5000 --medvram --no-half --device-id 0

Additional information, context and logs

I looked at dmesg to see if NVIDIA GPU was even being accessed: 7715.405625] audit: type=1104 audit(1674178721.569:149): pid=4150 uid=1000 auid=1000 ses=1 msg='op=PAM:setcred grantors=pam_faillock,pam_permit,pam_faillock acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success' [ 7894.619787] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 7894.619793] amdgpu: Failed to evict process queues [ 7894.619795] amdgpu: Failed to quiesce KFD [ 7894.659328] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 7894.659331] amdgpu: Resetting wave fronts (cpsch) on dev 00000000d322993a [ 7894.659337] amdgpu: Didn't find vmid for pasid 0x800a [ 7902.316836] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 7902.316846] amdgpu: Failed to evict process queues [ 7902.316848] amdgpu: Failed to evict queues of pasid 0x800a [ 8079.187035] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 8079.187040] amdgpu: Resetting wave fronts (cpsch) on dev 00000000d322993a [ 8079.187049] amdgpu: Didn't find vmid for pasid 0x800a [ 8177.501951] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 8177.501958] amdgpu: Failed to evict process queues [ 8177.501960] amdgpu: Failed to quiesce KFD [ 8177.544779] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 8177.544785] amdgpu: Resetting wave fronts (cpsch) on dev 00000000d322993a [ 8177.544794] amdgpu: Didn't find vmid for pasid 0x800a [ 8184.316016] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 8184.316022] amdgpu: Failed to evict process queues [ 8184.316023] amdgpu: Failed to evict queues of pasid 0x800a [ 8243.716307] audit: type=1100 audit(1674179249.881:150): pid=12740 uid=1000 auid=1000 ses=1 msg='op=PAM:authentication grantors=pam_faillock,pam_permit,pam_faillock acct="brendon" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/1 res=success' [ 8243.716847] audit: type=1101 audit(1674179249.884:151): pid=12740 uid=1000 auid=1000 ses=1 msg='op=PAM:accounting grantors=pam_unix,pam_permit,pam_time acct="brendon" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/1 res=success' [ 8243.718506] audit: type=1110 audit(1674179249.884:152): pid=12740 uid=1000 auid=1000 ses=1 msg='op=PAM:setcred grantors=pam_faillock,pam_permit,pam_faillock acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/1 res=success' [ 8243.718879] audit: type=1105 audit(1674179249.884:153): pid=12740 uid=1000 auid=1000 ses=1 msg='op=PAM:session_open grantors=pam_systemd_home,pam_limits,pam_unix,pam_permit acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/1 res=success'

I only found references to amdgpu, and none to the Nvidia GPU.

I am using proprietary NVIDIA drivers, version 525.78.01, and CUDA Version 12.0, on Arch Linux.

My Discrete Graphics Card is An NVIDIA GTX 1650 Mobile.

Krobix avatar Jan 20 '23 02:01 Krobix

Did you run "export CUDA_VISIBLE_DEVICES" ?

https://stackoverflow.com/questions/39649102/how-do-i-select-which-gpu-to-run-a-job-on

TrongleTag avatar Jan 20 '23 15:01 TrongleTag

Yes, I tried that. It still errors:

  File "/home/brendon/stable-diffusion-webui/webui.py", line 75, in initialize
    modules.sd_models.load_model()
  File "/home/brendon/stable-diffusion-webui/modules/sd_models.py", line 385, in load_model
    load_model_weights(sd_model, checkpoint_info)
  File "/home/brendon/stable-diffusion-webui/modules/sd_models.py", line 276, in load_model_weights
    model.logvar = model.logvar.to(devices.device)  # fix for training
RuntimeError: HIP error: hipErrorInvalidDevice
HIP kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.

I assume this is because there's only 1 CUDA device (as the integrated GPU isn't a CUDA device)

Krobix avatar Jan 20 '23 22:01 Krobix

not this problem, but i have both 3060 Nvidia and integrated AMD GPU and the amd video card is detected If i run lspci 2>/dev/null | grep VGA | grep "AMD" from webui.sh i have 05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 1638 (rev c5) wrong dependencies was installed

i manually check and install all dep from webui.sh and launch.py and copy in venv/lib if necessary, and mod both scripts for Python compatibility (3.8.10 dont work/3.10.9 work)

also check compatibility https://pytorch.org/get-started/locally/ last supported cuda version - 11.7, driver 515

dzhankhaev avatar Jan 21 '23 10:01 dzhankhaev

The last supported CUDA version was 11.7? So does CUDA 12.0 not work?

Krobix avatar Jan 22 '23 02:01 Krobix

I also have the same problem and also get almost the same output like @dzhankhaev if I run lspci 2>/dev/null | grep VGA | grep "AMD" (Ok, it's a bit different in my case. My output is actually 04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Renoir (rev c6)) and it now also installs the torch-version for AMD-GPUs instead of the one for NVIDIA-GPUs if I don't change anything.

I was however able to bypass these problems with some dirty edits to webui.sh changing the "Renoir" in line 111 and the "AMD" in line 119 to something else so the webui-script doesn't detect these things.

I have an NVIDIA RTX 2060 GPU and also an integrated AMD-GPU.

ChaoticHuman avatar Jan 26 '23 02:01 ChaoticHuman

This is caused by the undocument webui.sh script detecting the presence of an AMD GPU and installing the wrong version of CUDA due to using the wrong Torch repository.

The following patch fixes this. You will need to reinstall Torch (rm -rf venv):

diff --git a/webui.sh b/webui.sh
index 8cdad22..b8ea66c 100755
--- a/webui.sh
+++ b/webui.sh
@@ -118,7 +118,8 @@ case "$gpu_info" in
 esac
 if echo "$gpu_info" | grep -q "AMD" && [[ -z "${TORCH_COMMAND}" ]]
 then
-    export TORCH_COMMAND="pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.2"
+    #export TORCH_COMMAND="pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.2"
+    export TORCH_COMMAND="pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117"
 fi  
 
 for preq in "${GIT}" "${python_cmd}"

There are two things that ought to be done to fix this:

  1. Properly document what the 7 different launch scripts in the root repository directory are each for and what they do, or better consolidate them. Even a consolidated script needs documenting as to what it is going to do when you first launch it if that initial setup could ever be wrong later.
  2. don't assume that the presence of an AMD GPU means that's the only GPU, or the only GPU that will ever be present. eGPUs exist, and people can change their hardware. Detect what GPUs are present during setup and do something sensible. If different GPUs are present later than what has been detected before, display a message offering to install new Torch packages, if that's possible for those devices.

fish-face avatar Feb 02 '23 14:02 fish-face

@fish-face got it spot-on!

If you have multiple GPUs want to use the NVidia GPU you can just comment the entire if above and it will use the default TORCH_COMMAND from launch.py.

andrempiva avatar Mar 05 '23 17:03 andrempiva

Same here: Having NVIDIA on AMD-based system which has integrated AMD GPU like all(?) recent AMD CPUs.

If this should stay (instead of just… notifying the user and expecting an explicit command-line attribute to switch the behavior or something?), maybe it should check “no NVIDIA && AMD present” instead of just “AMD present”?

Or switch to some smarter way to detect AMD GPUs? E.g. something like glxinfo | grep "OpenGL vendor string"? Or maybe look for some /dev/nvidia* (no idea what AMD uses there)?

mormegil-cz avatar May 04 '23 15:05 mormegil-cz

this is really unfortunate that this is still broken even after a patch is proposed @AUTOMATIC1111

bghira avatar May 23 '23 04:05 bghira