OneTrainer [Bug]: ZLUDA doesn't work

What happened?

First Problem is that there is no Guide on how to launch the install.bat with the USE_ZLUDA flag. Secondly it doesnt ask for AMD Users or Zluda usage at install.

Now the big problem: When launching Onetrainer it asks for a file in HIP which isnt there. C:\Program Files\AMD\ROCm\6.1\bin\hiprtc0507.dll'

Idk if that files is related to HIP SDK 5.7 but i have 6.1 installed and its not in that path or anywhere else.

What did you expect would happen?

Finding a description on how to start install.bat with ZLUDA.
Or getting asked if im an AMD User to let it install ZLUDA Stuff
Ontrainer working with HIP SDK 6.1

Relevant log output

activating venv D:\Programme\AI-Zeug\OneTrainer\venv
Using Python "D:\Programme\AI-Zeug\OneTrainer\venv\Scripts\python.exe"
Failed to load ZLUDA: Could not find module 'C:\Program Files\AMD\ROCm\6.1\bin\hiprtc0507.dll' (or one of its dependencies). Try using the full path with constructor syntax.

Output of `pip freeze`

absl-py==2.1.0 accelerate==0.30.1 aiohappyeyeballs==2.3.4 aiohttp==3.10.0 aiosignal==1.3.1 antlr4-python3-runtime==4.9.3 async-timeout==4.0.3 attrs==24.1.0 bitsandbytes==0.43.1 certifi==2024.7.4 charset-normalizer==3.3.2 cloudpickle==3.0.0 colorama==0.4.6 coloredlogs==15.0.1 contourpy==1.2.1 customtkinter==5.2.2 cycler==0.12.1 dadaptation==3.2 darkdetect==0.8.0 -e git+https://github.com/huggingface/diffusers.git@dd4b731e68f88f58dfabfb68f28e00ede2bb90ae#egg=diffusers filelock==3.15.4 flatbuffers==24.3.25 fonttools==4.53.1 frozenlist==1.4.1 fsspec==2024.6.1 ftfy==6.2.0 grpcio==1.65.4 huggingface-hub==0.23.3 humanfriendly==10.0 idna==3.7 importlib_metadata==8.2.0 intel-openmp==2021.4.0 invisible-watermark==0.2.0 Jinja2==3.1.4 kiwisolver==1.4.5 lightning-utilities==0.11.6 lion-pytorch==0.1.4 Markdown==3.6 markdown-it-py==3.0.0 MarkupSafe==2.1.5 matplotlib==3.9.0 mdurl==0.1.2 -e git+https://github.com/Nerogar/mgds.git@901d0767033aa33a73a647cb447df64f3df1c6bc#egg=mgds mkl==2021.4.0 mpmath==1.3.0 multidict==6.0.5 networkx==3.3 numpy==1.26.4 omegaconf==2.3.0 onnxruntime-gpu==1.18.0 open-clip-torch==2.24.0 opencv-python==4.9.0.80 packaging==24.1 pillow==10.3.0 platformdirs==4.2.2 pooch==1.8.1 prodigyopt==1.0 protobuf==4.25.4 psutil==6.0.0 Pygments==2.18.0 pynvml==11.5.0 pyparsing==3.1.2 pyreadline3==3.4.1 python-dateutil==2.9.0.post0 pytorch-lightning==2.2.5 pytorch_optimizer==3.0.2 PyWavelets==1.6.0 PyYAML==6.0.1 regex==2024.7.24 requests==2.32.3 rich==13.7.1 safetensors==0.4.3 scalene==1.5.41 schedulefree==1.2.5 sentencepiece==0.2.0 six==1.16.0 sympy==1.13.1 tbb==2021.13.0 tensorboard==2.17.0 tensorboard-data-server==0.7.2 timm==1.0.8 tokenizers==0.19.1 torch==2.3.1+cu118 torchmetrics==1.4.1 torchvision==0.18.1+cu118 tqdm==4.66.4 transformers==4.42.3 typing_extensions==4.12.2 urllib3==2.2.2 wcwidth==0.2.13 Werkzeug==3.0.3 xformers==0.0.27+cu118 yarl==1.9.4 zipp==3.19.2

Aug 04 '24 18:08 CS1o

Unfortunately neither myself nor Nerogar have AMD cards so we can't verify any changes to bring it up to the modern day. I'll see if anyone in the discord is having trouble and/or needs to change anything.

Aug 04 '24 19:08 mx

Related to #422

Aug 04 '24 20:08 LeagueRaINi

By default it tries to use HIP SDK 5.7. If you have 6.1 installed instead, you need to edit ZludaInstaller to install a correct build of Zluda and use 6.1 dll and path.

Aug 04 '24 20:08 AznamirWoW

2024-08-08 05:54:33.2622693 [W:onnxruntime:, session_state.cc:1166 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2024-08-08 05:54:33.2661444 [W:onnxruntime:, session_state.cc:1168 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. 2024-08-08 05:54:33.4331646 [E:onnxruntime:, inference_session.cc:2045 onnxruntime::InferenceSession::Initialize::<lambda_d4e0caa0782683b2ee97e3859f73dc9c>::operator ()] Exception during initialization: C:\a\_work\1\s\onnxruntime\core\providers\cuda\cuda_call.cc:123 onnxruntime::CudaCall C:\a\_work\1\s\onnxruntime\core\providers\cuda\cuda_call.cc:116 onnxruntime::CudaCall CUDNN failure 4: CUDNN_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=DESKTOP_SILJA ; file=C:\a\_work\1\s\onnxruntime\core\providers\cuda\cuda_execution_provider.cc ; line=182 ; expr=cudnnSetStream(cudnn_handle_, stream); When trying to Get captions.

Changing instalation to Requirments.txt from -r Requirments-cuda.txt to -r Requiremnts-default.txt "Fixes" this issue somehow. Also zluda kinda works but never actually starts.

Aug 08 '24 03:08 VeteranXT

https://www.phoronix.com/news/AMD-ZLUDA-CUDA-Taken-Down

Aug 08 '24 05:08 O-J1

https://www.phoronix.com/news/AMD-ZLUDA-CUDA-Taken-Down

This worked fine, it was broken after update. Zluda cuda was taken down, There is no additional Developments with it. What worked should still work.

Aug 08 '24 08:08 VeteranXT

CUDNN_STATUS_INTERNAL_ERROR

This happens because Zluda does not support CUDNN and CUDNN has to be disabled in the application. Unfortunately that is not possible for Onnxruntime's CUDAExecutionProvider.

see https://github.com/microsoft/onnxruntime/discussions/18083

Aug 08 '24 09:08 AznamirWoW

Installed Latest version of it. without thinkering Untitled

This is what i get Caching never goes off.

Aug 08 '24 10:08 VeteranXT

Installed Latest version of it. without thinkering !

CUBLAS_STATUS error is happening because of the unpatched torch/lib, OneTrainer uses the same DLL substitution as SD.Next so you dont need to copy DLLs over manually, but as I said previously, if you're using HIP SDK 6.1 you need to manually edit ZludaInstaller,py after cloning but before running install.bat to properly install everything.

Aug 08 '24 16:08 AznamirWoW

I'm using HIP 5.7.0. So i dont need to edit anything? ROCm is 5.7. I didn't update anything.

Aug 08 '24 17:08 VeteranXT

I'm using HIP 5.7.0. So i dont need to edit anything? ROCm is 5.7. I didn't update anything.

HIP SDK 5.7 is what OneTrainer installer is using by default

Aug 09 '24 01:08 AznamirWoW

I tried the OneTrainer Fork of @LeagueRaINi which has his fix from here included: https://github.com/Nerogar/OneTrainer/pull/422

And it successfully worked with HIP SDK 6.1 The patch also is compatible with HIP SDK 5.7 and any future releases of HIP SDK.

So please merge his commit from above into OneTrainer and everything should be fine.

Aug 09 '24 11:08 CS1o

I can confirm this works using: AMD RX 6600 XT Thank you CS1o!

Aug 11 '24 15:08 VeteranXT

Tentatively closing this given LeagueRaINI does not want to further develop the PR and no one in the OT dev team has a AMD gpu + legal concerns. If someone finds official binaries that we can actually legally use and has a card + knows how to develop and wants to continue this then please talk with Nerogar in the discord.

Oct 13 '24 17:10 O-J1

I have AMD RX 6600 XT and i can help. But link to Discord is needed.

Oct 15 '24 17:10 VeteranXT

I have AMD RX 6600 XT and i can help. But link to Discord is needed.

The permanent link is on the front page of the repo right at the very top. Just click on the “7XX online” blue button on said front page of the OT repo.

Please note you must satisfy everything I listed. People joining to harass Nero will be moderated accordingly.

Oct 15 '24 18:10 O-J1

Hey, it makes no sense why this got closed. LeagueRaini provided a commit that only has to get merged that adds support for every present and upcoming HIP SDK Version. We already tested his fix on severall GPUs with ROCm HIP SDK 6.1 and its working without problems. ZLUDA also isnt abandonend and will continue to be developed. If legal concerns would be an issue why then support zluda with 5.7 ?

Please merge the commit. The AMD Community would really appreciate it! Or does LeagueRaini needs to add something to the PR?

Edit: In any case someone wants to use OneTrainer with ROCm HIP SDK 6.1 or 6.2, here is my Setup Guide which uses lshggtigers OneTrainer fork with Zluda. https://github.com/CS1o/Stable-Diffusion-Info/wiki/Lora-Trainer-Setup-Guides

Oct 22 '24 17:10 CS1o

[Bug]: ZLUDA doesn't work

What happened?

What did you expect would happen?

Relevant log output

Output of pip freeze

Output of `pip freeze`