ZLUDA icon indicating copy to clipboard operation
ZLUDA copied to clipboard

Cudnn/hipblas in Auto1111 AMD Fork useable?

Open CS1o opened this issue 9 months ago • 6 comments

Is Zluda with cudnn and hipblas useable in Stable-diffusion-webui-amdgpu? If so, does it have any advantage? For me the speed was the same and I got some issues. Do I need to enable some specific setting or launch arg? I wanted to try out Flash Attention. But idk if Auto1111 supports it.

The Preperation: AMD RX 7900XTX AMD Adrenalin 25.3.1 HIP SDK 6.2.4 for Windows 10 ZLUDA 3.9.0 nightly build for rocm6

What I have done: Made a fresh install of Stable-diffusion-webui-amdgpu Downloaded the HIP-SDK Extension and dropped and replaced it into ROCm\6.2\

Downloaded the hipblaslt-rocmlibs-for-gfx1100-gfx1101-gfx1102-gfx1103-gfx1150-for.hip6.2.7z and had to create the hipblaslt folder in ROCm\6.2\bin\ there I dropped the library folder in. So the tensile files are in: ROCm\6.2\bin\hipblaslt\library (idk if thats correct or not)

Launch args: --use-zluda --update-check --skip-ort --update-all-extensions --models-dir "D:\Programme\AI-Zeug\stable-diffusion-webui-directml\models" Added these two values to the webui-user.bat: set ZLUDA_NIGHTLY=1 set DISABLE_ADDMM_CUDA_LT=1

Replaced cublas_64.dll, cusparse64_11.dll, cublasLt64_11.dll, cudnn64_9.dll, cudart64_110.dll, nvrtc64_112_0.dll in venv/lib/site-packages/torch/lib

The Image Gen: After a long compile, Image generation works. But it has the same speed of my normal no cudnn zluda version. Also im not able to upscale with hires fix. -> out of memory crash. Blackscreen, Driver Timeout.

Problem with Extensions: Installed Extensions: Adetailer, booru tag autocompletion, Tiled Diffusion with Tiled VAE. Adetailer and Tiled VAE dont work at all and the cmd shows this error: Full CMD Log after launching, genning one image, then only enabled Adetailer and try to gen again:

Already up to date.
venv "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\venv\Scripts\Python.exe"
WARNING: ZLUDA works best with SD.Next. Please consider migrating to SD.Next.
Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)]
Version: v1.10.1-amd-25-g04bf93f1
Commit hash: 04bf93f1e8276526e695577df59fe37dd9bfaaee
ROCm: agents=['gfx1100', 'gfx1036']
ROCm: version=6.2, using agent gfx1100
ZLUDA support: experimental
ROCm hipBLASLt: arch=gfx1100 available=True
Using ZLUDA in D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\.zluda
No ROCm runtime is found, using ROCM_HOME='C:\Program Files\AMD\ROCm\6.2'
Skipping onnxruntime installation.
You are up to date with the most recent release.
Pulled changes for repository in 'D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\extensions\a1111-sd-webui-tagcomplete':
Already up to date.

Pulled changes for repository in 'D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\extensions\adetailer':
Already up to date.

Pulled changes for repository in 'D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\extensions\multidiffusion-upscaler-for-automatic1111':
Already up to date.

D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\venv\lib\site-packages\timm\models\layers\__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
No module 'xformers'. Proceeding without it.
D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\venv\lib\site-packages\pytorch_lightning\utilities\distributed.py:258: LightningDeprecationWarning: `pytorch_lightning.utilities.distributed.rank_zero_only` has been deprecated in v1.8.1 and will be removed in v2.0.0. You can import it from `pytorch_lightning.utilities` instead.
  rank_zero_deprecation(
Launching Web UI with arguments: --use-zluda --update-check --skip-ort --update-all-extensions --models-dir 'D:\Programme\AI-Zeug\stable-diffusion-webui-directml\models'
Tag Autocomplete: Could not locate model-keyword extension, Lora trigger word completion will be limited to those added through the extra networks menu.
[-] ADetailer initialized. version: 25.3.0, num models: 18
Loading weights [98a8837740] from D:\Programme\AI-Zeug\stable-diffusion-webui-directml\models\Stable-diffusion\SDXL\Illustrious\novaOrangeXL_v60.safetensors
Creating model from config: D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\repositories\generative-models\configs\inference\sd_xl_base.yaml
creating model quickly: OSError
Traceback (most recent call last):
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\venv\lib\site-packages\huggingface_hub\utils\_http.py", line 409, in hf_raise_for_status
    response.raise_for_status()
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\venv\lib\site-packages\requests\models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/None/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\venv\lib\site-packages\transformers\utils\hub.py", line 342, in cached_file
    resolved_file = hf_hub_download(
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\venv\lib\site-packages\huggingface_hub\utils\_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\venv\lib\site-packages\huggingface_hub\file_download.py", line 862, in hf_hub_download
    return _hf_hub_download_to_cache_dir(
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\venv\lib\site-packages\huggingface_hub\file_download.py", line 969, in _hf_hub_download_to_cache_dir
    _raise_on_head_call_error(head_call_error, force_download, local_files_only)
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\venv\lib\site-packages\huggingface_hub\file_download.py", line 1486, in _raise_on_head_call_error
    raise head_call_error
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\venv\lib\site-packages\huggingface_hub\file_download.py", line 1376, in _get_metadata_or_catch_error
    metadata = get_hf_file_metadata(
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\venv\lib\site-packages\huggingface_hub\utils\_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\venv\lib\site-packages\huggingface_hub\file_download.py", line 1296, in get_hf_file_metadata
    r = _request_wrapper(
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\venv\lib\site-packages\huggingface_hub\file_download.py", line 280, in _request_wrapper
    response = _request_wrapper(
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\venv\lib\site-packages\huggingface_hub\file_download.py", line 304, in _request_wrapper
    hf_raise_for_status(response)
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\venv\lib\site-packages\huggingface_hub\utils\_http.py", line 458, in hf_raise_for_status
    raise _format(RepositoryNotFoundError, message, response) from e
huggingface_hub.errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-67d604ea-42bd851f3343850e5ace5785;e3d223bc-1092-474b-bed4-2dac1f2191f6)

Repository Not Found for url: https://huggingface.co/None/resolve/main/config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\webyo\AppData\Local\Programs\Python\Python310\lib\threading.py", line 973, in _bootstrap
    self._bootstrap_inner()
  File "C:\Users\webyo\AppData\Local\Programs\Python\Python310\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "C:\Users\webyo\AppData\Local\Programs\Python\Python310\lib\threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\modules\initialize.py", line 149, in load_model
    shared.sd_model  # noqa: B018
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\modules\shared_items.py", line 190, in sd_model
    return modules.sd_models.model_data.get_sd_model()
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\modules\sd_models.py", line 693, in get_sd_model
    load_model()
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\modules\sd_models.py", line 831, in load_model
    sd_model = instantiate_from_config(sd_config.model, state_dict)
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\modules\sd_models.py", line 775, in instantiate_from_config
    return constructor(**params)
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\repositories\generative-models\sgm\models\diffusion.py", line 61, in __init__
    self.conditioner = instantiate_from_config(
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\repositories\generative-models\sgm\util.py", line 175, in instantiate_from_config
    return get_obj_from_str(config["target"])(**config.get("params", dict()))
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\repositories\generative-models\sgm\modules\encoders\modules.py", line 88, in __init__
    embedder = instantiate_from_config(embconfig)
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\repositories\generative-models\sgm\util.py", line 175, in instantiate_from_config
    return get_obj_from_str(config["target"])(**config.get("params", dict()))
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\repositories\generative-models\sgm\modules\encoders\modules.py", line 361, in __init__
    self.transformer = CLIPTextModel.from_pretrained(version)
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\modules\sd_disable_initialization.py", line 68, in CLIPTextModel_from_pretrained
    res = self.CLIPTextModel_from_pretrained(None, *model_args, config=pretrained_model_name_or_path, state_dict={}, **kwargs)
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\venv\lib\site-packages\transformers\modeling_utils.py", line 262, in _wrapper
    return func(*args, **kwargs)
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\venv\lib\site-packages\transformers\modeling_utils.py", line 3540, in from_pretrained
    resolved_config_file = cached_file(
  File "D:\Programme\AI-Zeug\SD-Zluda-Webui\stable-diffusion-webui-amdgpu\venv\lib\site-packages\transformers\utils\hub.py", line 365, in cached_file
    raise EnvironmentError(
OSError: None is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

Failed to create model quickly; will retry using slow method.
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 15.5s (prepare environment: 16.0s, initialize shared: 1.1s, other imports: 0.5s, load scripts: 1.2s, create ui: 0.8s, gradio launch: 1.4s).
Loading VAE weights specified in settings: D:\Programme\AI-Zeug\stable-diffusion-webui-directml\models\VAE\sdxl_vae_fp16.safetensors
Applying attention optimization: Doggettx... done.
Model loaded in 19.4s (load weights from disk: 0.4s, create model: 8.0s, apply weights to model: 8.9s, apply half(): 0.2s, load VAE: 0.6s, move model to device: 0.2s, load textual inversion embeddings: 0.2s, calculate empty prompt: 0.9s).
100%|██████████████████████████████████████████████████████████████████████████████████| 30/30 [00:13<00:00,  2.21it/s]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 30/30 [00:10<00:00,  2.85it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 30/30 [00:10<00:00,  2.93it/s]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 30/30 [00:09<00:00,  2.92it/s]
thread '<unnamed>' panicked at zluda_runtime\src\lib.rs:65:5:
not implemented
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at core\src\panicking.rs:223:5:
panic in a function that cannot unwind
stack backtrace:
   0:     0x7fff97d578d1 - _cudaRegisterTexture
   1:     0x7fff97d6486a - _cudaRegisterTexture
   2:     0x7fff97d560b7 - _cudaRegisterTexture
   3:     0x7fff97d57715 - _cudaRegisterTexture
   4:     0x7fff97d58c55 - _cudaRegisterTexture
   5:     0x7fff97d58a34 - _cudaRegisterTexture
   6:     0x7fff97d592e3 - _cudaRegisterTexture
   7:     0x7fff97d59132 - _cudaRegisterTexture
   8:     0x7fff97d5801f - _cudaRegisterTexture
   9:     0x7fff97d58d6e - _cudaRegisterTexture
  10:     0x7fff97d6cf65 - _cudaRegisterTexture
  11:     0x7fff97d6d013 - _cudaRegisterTexture
  12:     0x7fff97d6d091 - _cudaRegisterTexture
  13:     0x7fff97d52803 - _cudaPushCallConfiguration
  14:     0x7fffcbde1030 - <unknown>
  15:     0x7fffcbde4608 - is_exception_typeof
  16:     0x7fffedf11c26 - RtlCaptureContext2
  17:     0x7fff97d527ea - _cudaPushCallConfiguration
  18:     0x7ffee0263f97 - vision::cuda_version
  19:     0x7ffee0263bcb - vision::cuda_version
  20:     0x7ffee026339a - vision::cuda_version
  21:     0x7ffee0265ac8 - vision::cuda_version
  22:     0x7ffee0265e1a - vision::cuda_version
  23:     0x7ffee02653a3 - vision::cuda_version
  24:     0x7ffee02652d4 - vision::cuda_version
  25:     0x7ffee0265d4f - vision::cuda_version
  26:     0x7ffdd4dfc3ac - c10::Dispatcher::callBoxed
  27:     0x7fff70edd240 - torch::jit::invokeOperatorFromPython
  28:     0x7fff70eda2c7 - torch::jit::_get_operation_for_overload_or_packet
  29:     0x7fff70e42ca6 - registerPythonTensorClass
  30:     0x7fff70dea4e6 - registerPythonTensorClass
  31:     0x7fff7085140b - c10::ivalue::Future::devices
  32:     0x7fff99949eea - PyObject_IsTrue
  33:     0x7fff9998bdce - PyObject_Call
  34:     0x7fff9998becb - PyObject_Call
  35:     0x7fff999ab5e7 - PyEval_EvalFrameDefault
  36:     0x7fff999a49d7 - PyFunction_Vectorcall
  37:     0x7fff9995a8af - PyObject_FastCallDictTstate
  38:     0x7fff99a681f4 - PyObject_Call_Prepend
  39:     0x7fff99a68150 - PyBytesWriter_Resize
  40:     0x7fff999a9892 - PyEval_EvalFrameDefault
  41:     0x7fff999a49d7 - PyFunction_Vectorcall
  42:     0x7fff999a7293 - PyEval_EvalFrameDefault
  43:     0x7fff999a49d7 - PyFunction_Vectorcall
  44:     0x7fff999abacd - PyEval_EvalFrameDefault
  45:     0x7fff999a6a94 - PyEval_EvalFrameDefault
  46:     0x7fff9994b58b - PyObject_GetDictPtr
  47:     0x7fff999be037 - PyGen_Finalize
  48:     0x7fff9996ff5b - PyMem_RawMalloc
  49:     0x7fff999a6385 - PyEval_EvalFrameDefault
  50:     0x7fff9994b58b - PyObject_GetDictPtr
  51:     0x7fff9994b47a - PyObject_GetDictPtr
  52:     0x7fff9994aa0f - PyObject_GetDictPtr
  53:     0x7fff9994a7c4 - PyObject_GetDictPtr
  54:     0x7fff999a6033 - PyEval_EvalFrameDefault
  55:     0x7fff999a49d7 - PyFunction_Vectorcall
  56:     0x7fff9995a917 - PyObject_FastCallDictTstate
  57:     0x7fff99a681f4 - PyObject_Call_Prepend
  58:     0x7fff99a68150 - PyBytesWriter_Resize
  59:     0x7fff9998ffbb - PyObject_MakeTpCall
  60:     0x7fff999ac39f - PyEval_EvalFrameDefault
  61:     0x7fff999a3615 - PyObject_GC_Malloc
  62:     0x7fff9998c00c - PyVectorcall_Call
  63:     0x7fff9998be87 - PyObject_Call
  64:     0x7fff999ab5e7 - PyEval_EvalFrameDefault
  65:     0x7fff999a49d7 - PyFunction_Vectorcall
  66:     0x7fff9995a917 - PyObject_FastCallDictTstate
  67:     0x7fff99a681f4 - PyObject_Call_Prepend
  68:     0x7fff99a68150 - PyBytesWriter_Resize
  69:     0x7fff9998ffbb - PyObject_MakeTpCall
  70:     0x7fff999ac39f - PyEval_EvalFrameDefault
  71:     0x7fff999a49d7 - PyFunction_Vectorcall
  72:     0x7fff999abacd - PyEval_EvalFrameDefault
  73:     0x7fff999a8620 - PyEval_EvalFrameDefault
  74:     0x7fff999a49d7 - PyFunction_Vectorcall
  75:     0x7fff9998bfb0 - PyVectorcall_Call
  76:     0x7fff9998be87 - PyObject_Call
  77:     0x7fff999ab5e7 - PyEval_EvalFrameDefault
  78:     0x7fff999a49d7 - PyFunction_Vectorcall
  79:     0x7fff999a36f3 - PyObject_GC_Malloc
  80:     0x7fff9998bfb0 - PyVectorcall_Call
  81:     0x7fff9998be87 - PyObject_Call
  82:     0x7fff999ab5e7 - PyEval_EvalFrameDefault
  83:     0x7fff999a6a94 - PyEval_EvalFrameDefault
  84:     0x7fff999a49d7 - PyFunction_Vectorcall
  85:     0x7fff999a6033 - PyEval_EvalFrameDefault
  86:     0x7fff999a49d7 - PyFunction_Vectorcall
  87:     0x7fff999a7293 - PyEval_EvalFrameDefault
  88:     0x7fff999a49d7 - PyFunction_Vectorcall
  89:     0x7fff9998bfb0 - PyVectorcall_Call
  90:     0x7fff9998be87 - PyObject_Call
  91:     0x7fff999ab5e7 - PyEval_EvalFrameDefault
  92:     0x7fff999a49d7 - PyFunction_Vectorcall
  93:     0x7fff9998bfb0 - PyVectorcall_Call
  94:     0x7fff9998be87 - PyObject_Call
  95:     0x7fff999ab5e7 - PyEval_EvalFrameDefault
  96:     0x7fff999a49d7 - PyFunction_Vectorcall
  97:     0x7fff9998bfb0 - PyVectorcall_Call
  98:     0x7fff9998be87 - PyObject_Call
  99:     0x7fff999ab5e7 - PyEval_EvalFrameDefault
 100:     0x7fff999a49d7 - PyFunction_Vectorcall
 101:     0x7fff9998bfb0 - PyVectorcall_Call
 102:     0x7fff9998be87 - PyObject_Call
 103:     0x7fff999ab5e7 - PyEval_EvalFrameDefault
 104:     0x7fff999a49d7 - PyFunction_Vectorcall
 105:     0x7fff99b53c9d - PyContext_NewHamtForTests
 106:     0x7fff99b53f79 - PyContext_NewHamtForTests
 107:     0x7fff99964681 - PyArg_CheckPositional
 108:     0x7fff9998bfb0 - PyVectorcall_Call
 109:     0x7fff9998bd93 - PyObject_Call
 110:     0x7fff9998becb - PyObject_Call
 111:     0x7fff999ab5e7 - PyEval_EvalFrameDefault
 112:     0x7fff999a6a94 - PyEval_EvalFrameDefault
 113:     0x7fff999a6a94 - PyEval_EvalFrameDefault
 114:     0x7fff999a49d7 - PyFunction_Vectorcall
 115:     0x7fff999a3769 - PyObject_GC_Malloc
 116:     0x7fff9998bfb0 - PyVectorcall_Call
 117:     0x7fff9998bd93 - PyObject_Call
 118:     0x7fff99a18962 - PyRuntimeState_Fini
 119:     0x7fff99a188de - PyRuntimeState_Fini
 120:     0x7fffeb971bb2 - configthreadlocale
 121:     0x7fffebf87374 - BaseThreadInitThunk
 122:     0x7fffedebcc91 - RtlUserThreadStart
thread caused non-unwinding panic. aborting.
Drücken Sie eine beliebige Taste . . .

CS1o avatar Mar 15 '25 22:03 CS1o

cudart.dll is not necessary and incomplete. Simply exclude it. hipBLASLt will be disabled by DISABLE_ADDMM_CUDA_LT. Unset or set =0 if you want to enable it. ROCm\6.2\bin\hipblaslt\library is fine if hipblaslt.dll is in 6.2/bin. (note that typically hipBLASLt performs worse than rocBLAS) cudnn requires dev build on A1111. ~~dev.zip~~ (now dev branch is merged so just use 3.9.1 nightly) There is no speed gain by enabling flash attention, but there is big improvement for both speed and vram usage by enabling MIOpen Conv2d solver. It is enabled by default if you are using nightly build.

lshqqytiger avatar Mar 16 '25 01:03 lshqqytiger

For 3.9.1 is MIOpen Conv2d solver enabled or do I have to download nightly build too?

chain2k avatar Mar 17 '25 20:03 chain2k

cuDNN is not included in the automized GitHub Actions builds because MIOpen is unavailable on the official HIP SDK releases. So you have to use a nightly build.

lshqqytiger avatar Mar 18 '25 01:03 lshqqytiger

Thanks for the reply. I upgraded my test webui to Zluda 3.9.1 Nightly. reset the venv and .zluda folder. Replaced the .zluda files and torch/lib with the renamed zluda files mentioned at the top. I also upgraded my normal webui to Zluda 3.9.1 for comparison.

The nightly is a bit faster but not by much. Vram usage is nearly the same.

Here are my results: Auto1111 Zluda 3.9.1: Illustrious model. sampler euler a, resolution: 832x1216 = 2.81 it/s (11.5s) 832x1216 + hires fix upscale by 1.5, 10 hires steps, upscaler: Resrgan4xAnime6b, Tiled VAE enabled = 26.5 seconds

Auto1111 Zluda 3.9.1 Nightly: Illustrious model. sampler euler a, resolution: 832x1216 = 2.97 it/s (10.4s) 832x1216 + hires fix upscale by 1.5, 10 hires steps, upscaler: Resrgan4xAnime6b, Tiled VAE enabled = 23.6 seconds

I big difference happens when i Latent upscale 832x1216 by 2x. (This needs Tiled VAE enabled to not get oom) My normal webui spills for a short time into shared VRAM at the VAE Step but does produced an image. This is the vram usage: Image

The zluda nightly webui does spill into shared vram too but stays there at the VAE Decode Process of Tiled VAE and Freezes the whole PC. Image

Maybe i missed something while setting it up?

CS1o avatar Mar 18 '25 21:03 CS1o

cuDNN is not included in the automized GitHub Actions builds because MIOpen is unavailable on the official HIP SDK releases. So you have to use a nightly build.

Thanx, tryed CuDNN on 3.9.1 in ZLUDA-ComfyUI without success( gfx1150, changed C:\Program Files\AMD\ROCm\6.2. Changed cublas64_11.dll ,cusparse64_11.dll, cublasLt64_11.dll, nvrtc64_112_0.dll, cudnn64_9.dll. set ZLUDA_NIGHTLY=1 set DISABLE_ADDMM_CUDA_LT=1. Removed torch.backends.cudnn.enabled = False. Launched with --force-fp32 and without - no changes. On first workflow run after compiation - an error:

`thread '<unnamed>' panicked at zluda_dnn\src\lib.rs:1365:14:
[ZLUDA] Unknown descriptor type: 12
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

thread '<unnamed>' panicked at library\core\src\panicking.rs:218:5:
panic in a function that cannot unwind
stack backtrace:
   0:     0x7ffdfc5624c1 - cudnnConvolutionBackwardData
   1:     0x7ffdfc56f5aa - cudnnConvolutionBackwardData
   2:     0x7ffdfc5608a7 - cudnnConvolutionBackwardData
   3:     0x7ffdfc562305 - cudnnConvolutionBackwardData
   4:     0x7ffdfc56391f - cudnnConvolutionBackwardData
   5:     0x7ffdfc563682 - cudnnConvolutionBackwardData
   6:     0x7ffdfc56406f - cudnnConvolutionBackwardData
   7:     0x7ffdfc563ec2 - cudnnConvolutionBackwardData
   8:     0x7ffdfc562bff - cudnnConvolutionBackwardData
   9:     0x7ffdfc563afe - cudnnConvolutionBackwardData
  10:     0x7ffdfc5783c5 - cudnnConvolutionBackwardData
  11:     0x7ffdfc578473 - cudnnConvolutionBackwardData
  12:     0x7ffdfc578555 - cudnnConvolutionBackwardData
  13:     0x7ffdfc55cb03 - cudnnBackendCreateDescriptor
  14:     0x7ffe191b1030 - <unknown>
  15:     0x7ffe191b4608 - is_exception_typeof
  16:     0x7ffe214438c6 - RtlCaptureContext2
  17:     0x7ffdfc55cae7 - cudnnBackendCreateDescriptor
  18:     0x7ffc3d02e4b8 - at::native::cudnn_convolution_transpose
  19:     0x7ffc3d01f4c3 - at::native::cudnn_convolution_transpose
  20:     0x7ffc3d022d32 - at::native::cudnn_convolution_transpose
  21:     0x7ffc3d02bfa7 - at::native::cudnn_convolution_transpose
  22:     0x7ffc3d032237 - at::native::cudnn_convolution_transpose
  23:     0x7ffc3d03031b - at::native::cudnn_convolution_transpose
  24:     0x7ffc3cffd78b - at::native::cudnn_convolution_add_relu
  25:     0x7ffc3cffef57 - at::native::cudnn_convolution_transpose
  26:     0x7ffc3cffe70e - at::native::cudnn_convolution_transpose
  27:     0x7ffc3ecaf1a4 - at::cuda::where_outf
  28:     0x7ffc3ebd4bf3 - at::cuda::bucketize_outf
  29:     0x7ffc8e08657c - at::TensorMaker::make_tensor
  30:     0x7ffc8e160543 - at::_ops::cudnn_convolution_transpose::call
  31:     0x7ffc8dab83a2 - at::native::_convolution
  32:     0x7ffc8e9cd0dd - at::compositeexplicitautograd::view_copy_symint_outf
  33:     0x7ffc8e99bcde - at::compositeexplicitautograd::bucketize_outf
  34:     0x7ffc8e086214 - at::TensorMaker::make_tensor
  35:     0x7ffc8e12df8e - at::_ops::_convolution::call
  36:     0x7ffc8dab73bb - at::native::sym_size
  37:     0x7ffc8dac376b - at::native::convolution
  38:     0x7ffc8e9cefe3 - at::compositeexplicitautograd::view_copy_symint_outf
  39:     0x7ffc8e99bdef - at::compositeexplicitautograd::bucketize_outf
  40:     0x7ffc8e0860b0 - at::TensorMaker::make_tensor
  41:     0x7ffc8e15c999 - at::_ops::convolution::call
  42:     0x7ffc8dac3156 - at::native::conv_transpose2d_symint
  43:     0x7ffc8eb6650b - at::compositeimplicitautograd::where
  44:     0x7ffc8eb442f3 - at::compositeimplicitautograd::broadcast_to_symint
  45:     0x7ffc8e085f99 - at::TensorMaker::make_tensor
  46:     0x7ffc8e446980 - at::_ops::conv_transpose2d_input::call
  47:     0x7ffc3a00b71f - THPPointer<_frame>::release
  48:     0x7ffc3a065863 - THPPointer<_frame>::release
  49:     0x7ffdebb49eea - PyObject_IsTrue
  50:     0x7ffdebba9892 - PyEval_EvalFrameDefault
  51:     0x7ffdebba49d7 - PyFunction_Vectorcall
  52:     0x7ffdebba36f3 - PyObject_GC_Malloc
  53:     0x7ffdebb8bfb0 - PyVectorcall_Call
  54:     0x7ffdebb8be87 - PyObject_Call
  55:     0x7ffdebbab5e7 - PyEval_EvalFrameDefault
  56:     0x7ffdebba49d7 - PyFunction_Vectorcall
  57:     0x7ffdebba36f3 - PyObject_GC_Malloc
  58:     0x7ffdebb8bfb0 - PyVectorcall_Call
  59:     0x7ffdebb8be87 - PyObject_Call
  60:     0x7ffdebbab5e7 - PyEval_EvalFrameDefault
  61:     0x7ffdebba49d7 - PyFunction_Vectorcall
  62:     0x7ffdebb5a8af - PyObject_FastCallDictTstate
  63:     0x7ffdebc681f4 - PyObject_Call_Prepend
  64:     0x7ffdebc68150 - PyBytesWriter_Resize
  65:     0x7ffdebbaa598 - PyEval_EvalFrameDefault
  66:     0x7ffdebba49d7 - PyFunction_Vectorcall
  67:     0x7ffdebba36f3 - PyObject_GC_Malloc
  68:     0x7ffdebb8bfb0 - PyVectorcall_Call
  69:     0x7ffdebb8be87 - PyObject_Call
  70:     0x7ffdebbab5e7 - PyEval_EvalFrameDefault
  71:     0x7ffdebba49d7 - PyFunction_Vectorcall
  72:     0x7ffdebba36f3 - PyObject_GC_Malloc
  73:     0x7ffdebb8bfb0 - PyVectorcall_Call
  74:     0x7ffdebb8be87 - PyObject_Call
  75:     0x7ffdebbab5e7 - PyEval_EvalFrameDefault
  76:     0x7ffdebba49d7 - PyFunction_Vectorcall
  77:     0x7ffdebb5a8af - PyObject_FastCallDictTstate
  78:     0x7ffdebc681f4 - PyObject_Call_Prepend
  79:     0x7ffdebc68150 - PyBytesWriter_Resize
  80:     0x7ffdebba9892 - PyEval_EvalFrameDefault
  81:     0x7ffdebba3615 - PyObject_GC_Malloc
  82:     0x7ffdebba7293 - PyEval_EvalFrameDefault
  83:     0x7ffdebba49d7 - PyFunction_Vectorcall
  84:     0x7ffdebb8c00c - PyVectorcall_Call
  85:     0x7ffdebb8be87 - PyObject_Call
  86:     0x7ffdebbab5e7 - PyEval_EvalFrameDefault
  87:     0x7ffdebba8620 - PyEval_EvalFrameDefault
  88:     0x7ffdebba49d7 - PyFunction_Vectorcall
  89:     0x7ffdebb5a917 - PyObject_FastCallDictTstate
  90:     0x7ffdebc681f4 - PyObject_Call_Prepend
  91:     0x7ffdebc68150 - PyBytesWriter_Resize
  92:     0x7ffdebb8bf03 - PyObject_Call
  93:     0x7ffdebbab5e7 - PyEval_EvalFrameDefault
  94:     0x7ffdebba49d7 - PyFunction_Vectorcall
  95:     0x7ffdebbabacd - PyEval_EvalFrameDefault
  96:     0x7ffdebba3615 - PyObject_GC_Malloc
  97:     0x7ffdebb8c00c - PyVectorcall_Call
  98:     0x7ffdebb8be87 - PyObject_Call
  99:     0x7ffdebbab5e7 - PyEval_EvalFrameDefault
 100:     0x7ffdebba49d7 - PyFunction_Vectorcall
 101:     0x7ffdebba6033 - PyEval_EvalFrameDefault
 102:     0x7ffdebba49d7 - PyFunction_Vectorcall
 103:     0x7ffdebbabacd - PyEval_EvalFrameDefault
 104:     0x7ffdebba49d7 - PyFunction_Vectorcall
 105:     0x7ffdebbabacd - PyEval_EvalFrameDefault
 106:     0x7ffdebba49d7 - PyFunction_Vectorcall
 107:     0x7ffdebba6033 - PyEval_EvalFrameDefault
 108:     0x7ffdebba6a94 - PyEval_EvalFrameDefault
 109:     0x7ffdebba49d7 - PyFunction_Vectorcall
 110:     0x7ffdebb8bfb0 - PyVectorcall_Call
 111:     0x7ffdebb8be87 - PyObject_Call
 112:     0x7ffdebbab5e7 - PyEval_EvalFrameDefault
 113:     0x7ffdebba6a94 - PyEval_EvalFrameDefault
 114:     0x7ffdebba6a94 - PyEval_EvalFrameDefault
 115:     0x7ffdebba49d7 - PyFunction_Vectorcall
 116:     0x7ffdebba3769 - PyObject_GC_Malloc
 117:     0x7ffdebb8bfb0 - PyVectorcall_Call
 118:     0x7ffdebb8bd93 - PyObject_Call
 119:     0x7ffdebc18962 - PyRuntimeState_Fini
 120:     0x7ffdebc188de - PyRuntimeState_Fini
 121:     0x7ffe1efc37b0 - wcsrchr
 122:     0x7ffe1fc5e8d7 - BaseThreadInitThunk
 123:     0x7ffe213dbf6c - RtlUserThreadStart
thread caused non-unwinding panic. aborting.`

chain2k avatar Mar 19 '25 21:03 chain2k

without CuDNN it's really faster, much faster! Thanx! may be I need not only remove torch.backends.cudnn.enabled = False but also enable torch.backends.cuda.enable_cudnn_sdp(True) ?

chain2k avatar Mar 19 '25 21:03 chain2k