[Bug]: ZLUDA doesn't work
What happened?
First Problem is that there is no Guide on how to launch the install.bat with the USE_ZLUDA flag. Secondly it doesnt ask for AMD Users or Zluda usage at install.
Now the big problem: When launching Onetrainer it asks for a file in HIP which isnt there. C:\Program Files\AMD\ROCm\6.1\bin\hiprtc0507.dll'
Idk if that files is related to HIP SDK 5.7 but i have 6.1 installed and its not in that path or anywhere else.
What did you expect would happen?
- Finding a description on how to start install.bat with ZLUDA.
- Or getting asked if im an AMD User to let it install ZLUDA Stuff
- Ontrainer working with HIP SDK 6.1
Relevant log output
activating venv D:\Programme\AI-Zeug\OneTrainer\venv
Using Python "D:\Programme\AI-Zeug\OneTrainer\venv\Scripts\python.exe"
Failed to load ZLUDA: Could not find module 'C:\Program Files\AMD\ROCm\6.1\bin\hiprtc0507.dll' (or one of its dependencies). Try using the full path with constructor syntax.
Output of pip freeze
absl-py==2.1.0 accelerate==0.30.1 aiohappyeyeballs==2.3.4 aiohttp==3.10.0 aiosignal==1.3.1 antlr4-python3-runtime==4.9.3 async-timeout==4.0.3 attrs==24.1.0 bitsandbytes==0.43.1 certifi==2024.7.4 charset-normalizer==3.3.2 cloudpickle==3.0.0 colorama==0.4.6 coloredlogs==15.0.1 contourpy==1.2.1 customtkinter==5.2.2 cycler==0.12.1 dadaptation==3.2 darkdetect==0.8.0 -e git+https://github.com/huggingface/diffusers.git@dd4b731e68f88f58dfabfb68f28e00ede2bb90ae#egg=diffusers filelock==3.15.4 flatbuffers==24.3.25 fonttools==4.53.1 frozenlist==1.4.1 fsspec==2024.6.1 ftfy==6.2.0 grpcio==1.65.4 huggingface-hub==0.23.3 humanfriendly==10.0 idna==3.7 importlib_metadata==8.2.0 intel-openmp==2021.4.0 invisible-watermark==0.2.0 Jinja2==3.1.4 kiwisolver==1.4.5 lightning-utilities==0.11.6 lion-pytorch==0.1.4 Markdown==3.6 markdown-it-py==3.0.0 MarkupSafe==2.1.5 matplotlib==3.9.0 mdurl==0.1.2 -e git+https://github.com/Nerogar/mgds.git@901d0767033aa33a73a647cb447df64f3df1c6bc#egg=mgds mkl==2021.4.0 mpmath==1.3.0 multidict==6.0.5 networkx==3.3 numpy==1.26.4 omegaconf==2.3.0 onnxruntime-gpu==1.18.0 open-clip-torch==2.24.0 opencv-python==4.9.0.80 packaging==24.1 pillow==10.3.0 platformdirs==4.2.2 pooch==1.8.1 prodigyopt==1.0 protobuf==4.25.4 psutil==6.0.0 Pygments==2.18.0 pynvml==11.5.0 pyparsing==3.1.2 pyreadline3==3.4.1 python-dateutil==2.9.0.post0 pytorch-lightning==2.2.5 pytorch_optimizer==3.0.2 PyWavelets==1.6.0 PyYAML==6.0.1 regex==2024.7.24 requests==2.32.3 rich==13.7.1 safetensors==0.4.3 scalene==1.5.41 schedulefree==1.2.5 sentencepiece==0.2.0 six==1.16.0 sympy==1.13.1 tbb==2021.13.0 tensorboard==2.17.0 tensorboard-data-server==0.7.2 timm==1.0.8 tokenizers==0.19.1 torch==2.3.1+cu118 torchmetrics==1.4.1 torchvision==0.18.1+cu118 tqdm==4.66.4 transformers==4.42.3 typing_extensions==4.12.2 urllib3==2.2.2 wcwidth==0.2.13 Werkzeug==3.0.3 xformers==0.0.27+cu118 yarl==1.9.4 zipp==3.19.2
Unfortunately neither myself nor Nerogar have AMD cards so we can't verify any changes to bring it up to the modern day. I'll see if anyone in the discord is having trouble and/or needs to change anything.
Related to #422
By default it tries to use HIP SDK 5.7. If you have 6.1 installed instead, you need to edit ZludaInstaller to install a correct build of Zluda and use 6.1 dll and path.
2024-08-08 05:54:33.2622693 [W:onnxruntime:, session_state.cc:1166 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2024-08-08 05:54:33.2661444 [W:onnxruntime:, session_state.cc:1168 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. 2024-08-08 05:54:33.4331646 [E:onnxruntime:, inference_session.cc:2045 onnxruntime::InferenceSession::Initialize::<lambda_d4e0caa0782683b2ee97e3859f73dc9c>::operator ()] Exception during initialization: C:\a\_work\1\s\onnxruntime\core\providers\cuda\cuda_call.cc:123 onnxruntime::CudaCall C:\a\_work\1\s\onnxruntime\core\providers\cuda\cuda_call.cc:116 onnxruntime::CudaCall CUDNN failure 4: CUDNN_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=DESKTOP_SILJA ; file=C:\a\_work\1\s\onnxruntime\core\providers\cuda\cuda_execution_provider.cc ; line=182 ; expr=cudnnSetStream(cudnn_handle_, stream);
When trying to Get captions.
Changing instalation to Requirments.txt from -r Requirments-cuda.txt to -r Requiremnts-default.txt "Fixes" this issue somehow. Also zluda kinda works but never actually starts.
https://www.phoronix.com/news/AMD-ZLUDA-CUDA-Taken-Down
https://www.phoronix.com/news/AMD-ZLUDA-CUDA-Taken-Down
This worked fine, it was broken after update. Zluda cuda was taken down, There is no additional Developments with it. What worked should still work.
CUDNN_STATUS_INTERNAL_ERROR
This happens because Zluda does not support CUDNN and CUDNN has to be disabled in the application. Unfortunately that is not possible for Onnxruntime's CUDAExecutionProvider.
see https://github.com/microsoft/onnxruntime/discussions/18083
Installed Latest version of it. without thinkering
This is what i get Caching never goes off.
Installed Latest version of it. without thinkering !
CUBLAS_STATUS error is happening because of the unpatched torch/lib, OneTrainer uses the same DLL substitution as SD.Next so you dont need to copy DLLs over manually, but as I said previously, if you're using HIP SDK 6.1 you need to manually edit ZludaInstaller,py after cloning but before running install.bat to properly install everything.
I'm using HIP 5.7.0. So i dont need to edit anything? ROCm is 5.7. I didn't update anything.
I'm using HIP 5.7.0. So i dont need to edit anything? ROCm is 5.7. I didn't update anything.
HIP SDK 5.7 is what OneTrainer installer is using by default
I tried the OneTrainer Fork of @LeagueRaINi which has his fix from here included: https://github.com/Nerogar/OneTrainer/pull/422
And it successfully worked with HIP SDK 6.1 The patch also is compatible with HIP SDK 5.7 and any future releases of HIP SDK.
So please merge his commit from above into OneTrainer and everything should be fine.
I can confirm this works using: AMD RX 6600 XT Thank you CS1o!
Tentatively closing this given LeagueRaINI does not want to further develop the PR and no one in the OT dev team has a AMD gpu + legal concerns. If someone finds official binaries that we can actually legally use and has a card + knows how to develop and wants to continue this then please talk with Nerogar in the discord.
I have AMD RX 6600 XT and i can help. But link to Discord is needed.
I have AMD RX 6600 XT and i can help. But link to Discord is needed.
The permanent link is on the front page of the repo right at the very top. Just click on the “7XX online” blue button on said front page of the OT repo.
Please note you must satisfy everything I listed. People joining to harass Nero will be moderated accordingly.
Hey, it makes no sense why this got closed. LeagueRaini provided a commit that only has to get merged that adds support for every present and upcoming HIP SDK Version. We already tested his fix on severall GPUs with ROCm HIP SDK 6.1 and its working without problems. ZLUDA also isnt abandonend and will continue to be developed. If legal concerns would be an issue why then support zluda with 5.7 ?
Please merge the commit. The AMD Community would really appreciate it! Or does LeagueRaini needs to add something to the PR?
Edit: In any case someone wants to use OneTrainer with ROCm HIP SDK 6.1 or 6.2, here is my Setup Guide which uses lshggtigers OneTrainer fork with Zluda. https://github.com/CS1o/Stable-Diffusion-Info/wiki/Lora-Trainer-Setup-Guides