stable-diffusion-webui
stable-diffusion-webui copied to clipboard
[Bug]: When running on Linux on laptop with both an Nvidia and integrated AMD GPU, I can't get SD to use Nvidia GPU.
Is there an existing issue for this?
- [X] I have searched the existing issues and checked the recent builds/commits
What happened?
When running the command:
./webui.sh --listen --port 5000 --medvram --no-half --device-id 0 (using --device-id 0 here because there is only one CUDA device):
This runs up until loading a model, at which point it hangs indefinitely, and the NVIDIA GPU is never accessed.
I have tried adding prime-run (the command which would typically be used to run some programs on NVIDIA GPU) to the beginning of the command, but as far as I can tell, nothing changes.
When I used CPU only, it ran just fine.
Steps to reproduce the problem
- Install SD Web UI on Linux Laptop with two GPU's using given instructions
- Download Models
- Try to run using varying different commands
What should have happened?
When I run the command, I should be able to load the SD model on my NVIDIA GPU instead of the integrated AMD GPU.
Commit where the problem happens
c12d7ddd725c485682c1caa025627c9ee936d743
What platforms do you use to access UI ?
Linux
What browsers do you use to access the UI ?
Mozilla Firefox
Command Line Arguments
./webui.sh --listen --port 5000 --medvram --no-half --device-id 0
Additional information, context and logs
I looked at dmesg to see if NVIDIA GPU was even being accessed:
7715.405625] audit: type=1104 audit(1674178721.569:149): pid=4150 uid=1000 auid=1000 ses=1 msg='op=PAM:setcred grantors=pam_faillock,pam_permit,pam_faillock acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success' [ 7894.619787] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 7894.619793] amdgpu: Failed to evict process queues [ 7894.619795] amdgpu: Failed to quiesce KFD [ 7894.659328] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 7894.659331] amdgpu: Resetting wave fronts (cpsch) on dev 00000000d322993a [ 7894.659337] amdgpu: Didn't find vmid for pasid 0x800a [ 7902.316836] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 7902.316846] amdgpu: Failed to evict process queues [ 7902.316848] amdgpu: Failed to evict queues of pasid 0x800a [ 8079.187035] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 8079.187040] amdgpu: Resetting wave fronts (cpsch) on dev 00000000d322993a [ 8079.187049] amdgpu: Didn't find vmid for pasid 0x800a [ 8177.501951] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 8177.501958] amdgpu: Failed to evict process queues [ 8177.501960] amdgpu: Failed to quiesce KFD [ 8177.544779] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 8177.544785] amdgpu: Resetting wave fronts (cpsch) on dev 00000000d322993a [ 8177.544794] amdgpu: Didn't find vmid for pasid 0x800a [ 8184.316016] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 8184.316022] amdgpu: Failed to evict process queues [ 8184.316023] amdgpu: Failed to evict queues of pasid 0x800a [ 8243.716307] audit: type=1100 audit(1674179249.881:150): pid=12740 uid=1000 auid=1000 ses=1 msg='op=PAM:authentication grantors=pam_faillock,pam_permit,pam_faillock acct="brendon" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/1 res=success' [ 8243.716847] audit: type=1101 audit(1674179249.884:151): pid=12740 uid=1000 auid=1000 ses=1 msg='op=PAM:accounting grantors=pam_unix,pam_permit,pam_time acct="brendon" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/1 res=success' [ 8243.718506] audit: type=1110 audit(1674179249.884:152): pid=12740 uid=1000 auid=1000 ses=1 msg='op=PAM:setcred grantors=pam_faillock,pam_permit,pam_faillock acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/1 res=success' [ 8243.718879] audit: type=1105 audit(1674179249.884:153): pid=12740 uid=1000 auid=1000 ses=1 msg='op=PAM:session_open grantors=pam_systemd_home,pam_limits,pam_unix,pam_permit acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/1 res=success'
I only found references to amdgpu, and none to the Nvidia GPU.
I am using proprietary NVIDIA drivers, version 525.78.01, and CUDA Version 12.0, on Arch Linux.
My Discrete Graphics Card is An NVIDIA GTX 1650 Mobile.
Did you run "export CUDA_VISIBLE_DEVICES" ?
https://stackoverflow.com/questions/39649102/how-do-i-select-which-gpu-to-run-a-job-on
Yes, I tried that. It still errors:
File "/home/brendon/stable-diffusion-webui/webui.py", line 75, in initialize
modules.sd_models.load_model()
File "/home/brendon/stable-diffusion-webui/modules/sd_models.py", line 385, in load_model
load_model_weights(sd_model, checkpoint_info)
File "/home/brendon/stable-diffusion-webui/modules/sd_models.py", line 276, in load_model_weights
model.logvar = model.logvar.to(devices.device) # fix for training
RuntimeError: HIP error: hipErrorInvalidDevice
HIP kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
I assume this is because there's only 1 CUDA device (as the integrated GPU isn't a CUDA device)
not this problem, but i have both 3060 Nvidia and integrated AMD GPU
and the amd video card is detected
If i run
lspci 2>/dev/null | grep VGA | grep "AMD"
from webui.sh
i have
05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 1638 (rev c5)
wrong dependencies was installed
i manually check and install all dep from webui.sh and launch.py and copy in venv/lib if necessary, and mod both scripts for Python compatibility (3.8.10 dont work/3.10.9 work)
also check compatibility https://pytorch.org/get-started/locally/ last supported cuda version - 11.7, driver 515
The last supported CUDA version was 11.7? So does CUDA 12.0 not work?
I also have the same problem and also get almost the same output like @dzhankhaev if I run lspci 2>/dev/null | grep VGA | grep "AMD" (Ok, it's a bit different in my case. My output is actually 04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Renoir (rev c6)) and it now also installs the torch-version for AMD-GPUs instead of the one for NVIDIA-GPUs if I don't change anything.
I was however able to bypass these problems with some dirty edits to webui.sh changing the "Renoir" in line 111 and the "AMD" in line 119 to something else so the webui-script doesn't detect these things.
I have an NVIDIA RTX 2060 GPU and also an integrated AMD-GPU.
This is caused by the undocument webui.sh script detecting the presence of an AMD GPU and installing the wrong version of CUDA due to using the wrong Torch repository.
The following patch fixes this. You will need to reinstall Torch (rm -rf venv):
diff --git a/webui.sh b/webui.sh
index 8cdad22..b8ea66c 100755
--- a/webui.sh
+++ b/webui.sh
@@ -118,7 +118,8 @@ case "$gpu_info" in
esac
if echo "$gpu_info" | grep -q "AMD" && [[ -z "${TORCH_COMMAND}" ]]
then
- export TORCH_COMMAND="pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.2"
+ #export TORCH_COMMAND="pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.2"
+ export TORCH_COMMAND="pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117"
fi
for preq in "${GIT}" "${python_cmd}"
There are two things that ought to be done to fix this:
- Properly document what the 7 different launch scripts in the root repository directory are each for and what they do, or better consolidate them. Even a consolidated script needs documenting as to what it is going to do when you first launch it if that initial setup could ever be wrong later.
- don't assume that the presence of an AMD GPU means that's the only GPU, or the only GPU that will ever be present. eGPUs exist, and people can change their hardware. Detect what GPUs are present during setup and do something sensible. If different GPUs are present later than what has been detected before, display a message offering to install new Torch packages, if that's possible for those devices.
@fish-face got it spot-on!
If you have multiple GPUs want to use the NVidia GPU you can just comment the entire if above and it will use the default TORCH_COMMAND from launch.py.
Same here: Having NVIDIA on AMD-based system which has integrated AMD GPU like all(?) recent AMD CPUs.
If this should stay (instead of just… notifying the user and expecting an explicit command-line attribute to switch the behavior or something?), maybe it should check “no NVIDIA && AMD present” instead of just “AMD present”?
Or switch to some smarter way to detect AMD GPUs? E.g. something like glxinfo | grep "OpenGL vendor string"? Or maybe look for some /dev/nvidia* (no idea what AMD uses there)?
this is really unfortunate that this is still broken even after a patch is proposed @AUTOMATIC1111