stable-diffusion-webui [Feature Request]: ZLUDA support

[Feature Request]: ZLUDA support

Open zcq06 opened this issue 4 months ago • 64 comments

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What would your feature do ?

SD.NET has preliminarily supported the use of ZLUDA on amd gpu, and the efficiency is very high, which is three times that of rocm. Can the method of adapting SD.NET be used for preliminary adaptation now?

Proposed workflow

https://github.com/lshqqytiger/ZLUDA https://github.com/vladmandic/automatic/wiki/ZLUDA https://github.com/vladmandic/automatic/commit/cc4438651f69973e055b63071bf1d0e3cc558eea https://github.com/vosen/ZLUDA

Additional information

No response

Feb 17 '24 15:02 zcq06

This would be great as sdnext doesnt support using controlnet in txt2image when using zluda and diffusers backend

Feb 19 '24 12:02 mrbudgen

This would be nice. I was able to make it somewhat work with SD.Next, but I really don't like that GUI, I just can't use it effectively. I don't need it here desperately, but I would really love to be able to make comparisons between AMD and Nvidia GPUs using the exact same workflow in usable ui.

Edit: also some features i need for my workflow didn't work in SD next for me. At least for now, the img2img didn't work.

Also, I would appreciate not to have to use multiple forks

Feb 20 '24 14:02 Coremar2

It's kinda easy you can just snip in

torch.backends.cudnn.enabled = False
torch.backends.cuda.enable_flash_sdp(False)
torch.backends.cuda.enable_math_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(False)

at somewhere around https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/master/modules/initialize.py#L57

Feb 21 '24 03:02 wfjsw

It's kinda easy you can just snip in
torch.backends.cudnn.enabled = False
torch.backends.cuda.enable_flash_sdp(False)
torch.backends.cuda.enable_math_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(False)
at somewhere around https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/master/modules/initialize.py#L57

You are right, fallowing ZLUDA installation guide by SD.Next and using your solution for disabling cudnn makes it work. It is really rather simple to make it work with main a1111. Much simpler than ONNX and Olive anywhere, LOL. Thanks for you answer. Now I need to only figure out why it randomly causes only around 50-60% GPU load. (it went like 4it/s for around half of the generating and jumped to 9it/s for the rest etc. but it looked like it was caused by Windows or driver, so probably not related to SD, will investigate further )

Feb 21 '24 12:02 Coremar2

It's kinda easy you can just snip in
torch.backends.cudnn.enabled = False
torch.backends.cuda.enable_flash_sdp(False)
torch.backends.cuda.enable_math_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(False)
at somewhere around https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/master/modules/initialize.py#L57
You are right, fallowing ZLUDA installation guide by SD.Next and using your solution for disabling cudnn makes it work. It is really rather simple to make it work with main a1111. Much simpler than ONNX and Olive anywhere, LOL. Thanks for you answer. Now I need to only figure out why it randomly causes only around 50-60% GPU load. (it went like 4it/s for around half of the generating and jumped to 9it/s for the rest etc. but it looked like it was caused by Windows or driver, so probably not related to SD, will investigate further )

Can you explain a little how you make it work? i'm trying with a new install of A1111 with no luck Thanks

Feb 21 '24 18:02 celulari

You are right, fallowing ZLUDA installation guide by SD.Next and using your solution for disabling cudnn makes it work. It is really rather simple to make it work with main a1111. Much simpler than ONNX and Olive anywhere, LOL. Thanks for you answer. Now I need to only figure out why it randomly causes only around 50-60% GPU load. (it went like 4it/s for around half of the generating and jumped to 9it/s for the rest etc. but it looked like it was caused by Windows or driver, so probably not related to SD, will investigate further )

Can you explain a little how you make it work? i'm trying with a new install of A1111 with no luck Thanks

I mostly followed this guide: https://github.com/vladmandic/automatic/wiki/ZLUDA it is for SD.Next, so you can either use SD.next, or skip these parts:

"Install or checkout dev" - I installed main automatic1111 instead (don't forget to start it at least once)

"Install CUDA Torch" - should already be present

"Compilation, Settings, and First Generation" - you first need to disable cudnn (it is not yet supported), by adding those lines from wfjsw to that file mentioned. And after that just start a1111 and generate. (it might take longer to start generating the first time.

I hope I explained it well enough, sorry I'm writing on my phone now and running out of time to write :D

Feb 21 '24 20:02 Coremar2

It's kinda easy you can just snip in
torch.backends.cudnn.enabled = False
torch.backends.cuda.enable_flash_sdp(False)
torch.backends.cuda.enable_math_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(False)
at somewhere around https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/master/modules/initialize.py#L57

Im trying to add those lines to initialize.py, but i get the followin error: NameError: name 'torch' is not defined

I've tried to add them to different parts of the file with no success.

Feb 21 '24 23:02 celulari

You can add it below https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/master/modules/initialize.py#L14 instead.

Feb 21 '24 23:02 wfjsw

You can add it below https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/master/modules/initialize.py#L14 instead.

Thanks! that made it work! not as fast as SD.next with diffusers backend, but much consistent generation

Feb 21 '24 23:02 celulari

btw you should use --opt-sdp-attention if you are not using it

Feb 21 '24 23:02 wfjsw

btw you should use --opt-sdp-attention if you are not using it

I'ts working faster now, thanks.

Feb 22 '24 03:02 celulari

It's kinda easy you can just snip in
torch.backends.cudnn.enabled = False
torch.backends.cuda.enable_flash_sdp(False)
torch.backends.cuda.enable_math_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(False)
at somewhere around https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/master/modules/initialize.py#L57
Im trying to add those lines to initialize.py, but i get the followin error: NameError: name 'torch' is not defined

I've tried to add them to different parts of the file with no success.

Hey, I managed to get it working on my RX 7800 XT with SD.next without needing to add any of those lines of code - I just set the enviroment variable set DISABLE_ADDMM_CUDA_LT=1 in the .bat file I use to launch it, as per the recommendations of the dev of ZLUDA, for pytorch compatibility: https://github.com/lshqqytiger/ZLUDA?tab=readme-ov-file#pytorch

Feb 23 '24 01:02 james-banks

Hi and thanks everybody,

You can add it below https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/master/modules/initialize.py#L14 instead.

I'm stuck here. I can start SD, but everytime I try to generate something I got this error :

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

I'm not sure if it's related or not. SD.Next is working fine with Zluda, but I need the automatic1111 webui for some reason.

Feb 28 '24 14:02 MooDaCow

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

That means the code is not correctly applied.

Mar 01 '24 06:03 wfjsw

my commandline will be set COMMANDLINE_ARGS=--opt-sdp-attention --no-half --use-directml ? the SD is not using GPU only CPU...

Mar 01 '24 14:03 williammc2

my commandline will be set COMMANDLINE_ARGS=--opt-sdp-attention --no-half --use-directml ? the SD is not using GPU only CPU...

Remove the --use-directml

Mar 01 '24 15:03 celulari

my commandline will be set COMMANDLINE_ARGS=--opt-sdp-attention --no-half --use-directml ? the SD is not using GPU only CPU...

Remove the --use-directml

Ty @celulari Now i'm getting this error.

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

I put the code in line 14 bellow import torch on initialize.py file

torch.backends.cudnn.enabled = False torch.backends.cuda.enable_flash_sdp(False) torch.backends.cuda.enable_math_sdp(True) torch.backends.cuda.enable_mem_efficient_sdp(False)

My Zluda is ok, i can access in cmd, the PATH its ok

Mar 01 '24 15:03 williammc2

my commandline will be set COMMANDLINE_ARGS=--opt-sdp-attention --no-half --use-directml ? the SD is not using GPU only CPU...

Remove the --use-directml

Ty @celulari Now i'm getting this error.
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
I put the code in line 14 bellow import torch on initialize.py file

torch.backends.cudnn.enabled = False torch.backends.cuda.enable_flash_sdp(False) torch.backends.cuda.enable_math_sdp(True) torch.backends.cuda.enable_mem_efficient_sdp(False)

My Zluda is ok, i can access in cmd, the PATH its ok

Have you tried my suggestion from a few comments above? https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/14946#issuecomment-1960602242

Mar 01 '24 16:03 james-banks

my commandline will be set COMMANDLINE_ARGS=--opt-sdp-attention --no-half --use-directml ? the SD is not using GPU only CPU...

Remove the --use-directml

Ty @celulari Now i'm getting this error.
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
I put the code in line 14 bellow import torch on initialize.py file torch.backends.cudnn.enabled = False torch.backends.cuda.enable_flash_sdp(False) torch.backends.cuda.enable_math_sdp(True) torch.backends.cuda.enable_mem_efficient_sdp(False) My Zluda is ok, i can access in cmd, the PATH its ok
Have you tried my suggestion from a few comments above? #14946 (comment)

Yes, same erro. I'm using new instalation

Mar 01 '24 16:03 williammc2

My initialize.py looks like this:

I'm using a RX6800

Mar 01 '24 17:03 celulari

my commandline will be set COMMANDLINE_ARGS=--opt-sdp-attention --no-half --use-directml ? the SD is not using GPU only CPU...

Remove the --use-directml

Ty @celulari Now i'm getting this error.
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
I put the code in line 14 bellow import torch on initialize.py file

torch.backends.cudnn.enabled = False torch.backends.cuda.enable_flash_sdp(False) torch.backends.cuda.enable_math_sdp(True) torch.backends.cuda.enable_mem_efficient_sdp(False)

My Zluda is ok, i can access in cmd, the PATH its ok

What's the output of:

.\venv\Scripts\activate
python
import torch
ten1 = torch.randn((2, 4,), device="cuda")
ten2 = torch.randn((4, 8,), device="cuda")
torch.mm(ten1, ten2)

Mar 01 '24 17:03 celulari

I am with SD.Next currently and having same exact problem with CUBLAS (as that is topic that had risen that problem) Arguments set DISABLE_ADDMM_CUDA_LT=1 and --use-zluda are both being passed, and i ensured that HIP device is visible via AMD_LOG_LEVEL=3 (it also fully compiled model before throwing this error in the first place based on zluda.db).

What's the output of:

Error is exactly this

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

Mar 01 '24 19:03 DimkaTsv

my commandline will be set COMMANDLINE_ARGS=--opt-sdp-attention --no-half --use-directml ? the SD is not using GPU only CPU...

Remove the --use-directml

Ty @celulari Now i'm getting this error.
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
I put the code in line 14 bellow import torch on initialize.py file torch.backends.cudnn.enabled = False torch.backends.cuda.enable_flash_sdp(False) torch.backends.cuda.enable_math_sdp(True) torch.backends.cuda.enable_mem_efficient_sdp(False) My Zluda is ok, i can access in cmd, the PATH its ok
What's the output of:
.\venv\Scripts\activate
python
import torch
ten1 = torch.randn((2, 4,), device="cuda")
ten2 = torch.randn((4, 8,), device="cuda")
torch.mm(ten1, ten2)

Mar 01 '24 20:03 williammc2

Did you correctly replace the cublas dll

Mar 01 '24 20:03 wfjsw

Did you correctly replace the cublas dll

I assume not (not that ZLUDA guide mentioned this specific step). I hadn't replaced anything yet, tbh. I wanted to, but couldn't find WHERE specifically should it be placed/replaced. It should be in pytorch directory, isn't it? Where specifically is that pytorch runtime? In .../venv/Lib/site-packages i only see pytorch_lightning and pytorch_lightning-1.9.4.dist-info

...

Oh, i see... It is literally torch, not pytorch. There are two libraries, cublasLt64_11.dll (531.3 MB) and cublas64_11.dll (86.6 MB)... I don't think that these are ones i need to replace, aren't they? ZLUDA cublass.dll is 166 KB in size only.

Mar 01 '24 20:03 DimkaTsv

Did you correctly replace the cublas dll

yes but i'm getting this error:

OSError: [WinError 126] The specified module could not be found. Error loading "D:\stable-diffusion\stable-diffusion-webui\venv\lib\site-packages\torch\lib\cublas64_11.dll" or one of its dependencies.

Mar 01 '24 21:03 williammc2

Did you correctly replace the cublas dll

I assume not (not that ZLUDA guide mentioned this specific step). I hadn't replaced anything yet, tbh. I wanted to, but couldn't find WHERE specifically should it be placed/replaced. It should be in pytorch directory, isn't it? Where specifically is that pytorch runtime? In .../venv/Lib/site-packages i only see pytorch_lightning and pytorch_lightning-1.9.4.dist-info

...

Oh, i see... It is literally torch, not pytorch. There are two libraries, cublasLt64_11.dll (531.3 MB) and cublas64_11.dll (86.6 MB)... I don't think that these are ones i need to replace, aren't they? ZLUDA cublass.dll is 166 KB in size only.

The latest version of the guide doesn't give the explicit steps because the SD.Next installer will do it for you with --use-zluda.

These are the steps:

Ensure you are using torch: 2.2.0+cu118:

.\venv\Scripts\activate
pip uninstall torch torchvision torch-directml -y
pip install torch==2.2.0 torchvision --index-url https://download.pytorch.org/whl/cu118

Rename these 3 .dll files from the ZLUDA .zip archive:

cublas.dll -> cublas64_11.dll
cusparse.dll -> cusparse64_11.dll
nvrtc.dll -> nvrtc64_112_0.dll

Replace all of these in the aformentioned sd_install_folder\venv\lib\site-packages\torch\lib folder, and you should be good to go.

Mar 01 '24 21:03 james-banks

Did you correctly replace the cublas dll

I assume not (not that ZLUDA guide mentioned this specific step). I hadn't replaced anything yet, tbh. I wanted to, but couldn't find WHERE specifically should it be placed/replaced. It should be in pytorch directory, isn't it? Where specifically is that pytorch runtime? In .../venv/Lib/site-packages i only see pytorch_lightning and pytorch_lightning-1.9.4.dist-info ... Oh, i see... It is literally torch, not pytorch. There are two libraries, cublasLt64_11.dll (531.3 MB) and cublas64_11.dll (86.6 MB)... I don't think that these are ones i need to replace, aren't they? ZLUDA cublass.dll is 166 KB in size only.

The latest version of the guide doesn't give the explicit steps because the SD.Next installer will do it for you with --use-zluda.

These are the steps:

Ensure you are using torch: 2.2.0+cu118:
.\venv\Scripts\activate
pip uninstall torch torchvision torch-directml -y
pip install torch==2.2.0 torchvision --index-url https://download.pytorch.org/whl/cu118
Rename these 3 .dll files from the ZLUDA .zip archive:

cublas.dll -> cublas64_11.dll

cusparse.dll -> cusparse64_11.dll

nvrtc.dll -> nvrtc64_112_0.dll

Replace all of these in the aformentioned sd_install_folder\venv\lib\site-packages\torch\lib folder, and you should be good to go.

Doing exacty these steps...

OSError: [WinError 126] The specified module could not be found. Error loading "D:\stable-diffusion\stable-diffusion-webui\venv\lib\site-packages\torch\lib\caffe2_nvrtc.dll" or one of its dependencies.

Mar 01 '24 21:03 williammc2

Doing exacty these steps...

OSError: [WinError 126] The specified module could not be found. Error loading "D:\stable-diffusion\stable-diffusion-webui\venv\lib\site-packages\torch\lib\caffe2_nvrtc.dll" or one of its dependencies.

Are you using this version of ZLUDA? https://github.com/lshqqytiger/ZLUDA

Also, if you are and it still isn't working, try running python -m torch.utils.collect_env after activating the venv as per my previous comment and paste the output here.

Mar 01 '24 22:03 james-banks

The latest version of the guide doesn't give the explicit steps because the SD.Next installer will do it for you with --use-zluda

Apparently not. Unless it does symlink though /.zluda folder I already do that and tried --reinstall flag as well.

Are you using this version of ZLUDA? https://github.com/lshqqytiger/ZLUDA

Yes

Also, if you are and it still isn't working, try running python -m torch.utils.collect_env after activating the venv as per my previous comment and paste the output here.

Collecting environment information...
PyTorch version: 2.2.0+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Майкрософт Windows 11 Pro
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22631-SP0
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture=9
CurrentClockSpeed=3401
DeviceID=CPU0
Family=107
L2CacheSize=4096
L2CacheSpeed=
Manufacturer=AuthenticAMD
MaxClockSpeed=3401
Name=AMD Ryzen 7 5800X3D 8-Core Processor
ProcessorType=3
Revision=8450

Versions of relevant libraries:
[pip3] dctorch==0.1.2
[pip3] numpy==1.26.4
[pip3] onnx==1.15.0
[pip3] onnxruntime==1.17.1
[pip3] onnxruntime-directml==1.17.1
[pip3] open-clip-torch==2.24.0
[pip3] pytorch-lightning==1.9.4
[pip3] torch==2.2.0+cu118
[pip3] torchdiffeq==0.2.3
[pip3] torchmetrics==1.3.1
[pip3] torchsde==0.2.6
[pip3] torchvision==0.17.0+cu118
[conda] Could not collect

And HIP does work for me, i even have HIP 5.7 SDK installed.

Mar 01 '24 22:03 DimkaTsv

stable-diffusion-webui stable-diffusion-webui copied to clipboard

[Feature Request]: ZLUDA support

Is there an existing issue for this?

What would your feature do ?

Proposed workflow

Additional information

stable-diffusion-webui
stable-diffusion-webui copied to clipboard