OneTrainer icon indicating copy to clipboard operation
OneTrainer copied to clipboard

[Bug]: ZLUDA doesn't work

Open CS1o opened this issue 1 year ago • 13 comments

What happened?

First Problem is that there is no Guide on how to launch the install.bat with the USE_ZLUDA flag. Secondly it doesnt ask for AMD Users or Zluda usage at install.

Now the big problem: When launching Onetrainer it asks for a file in HIP which isnt there. C:\Program Files\AMD\ROCm\6.1\bin\hiprtc0507.dll'

Idk if that files is related to HIP SDK 5.7 but i have 6.1 installed and its not in that path or anywhere else.

What did you expect would happen?

  • Finding a description on how to start install.bat with ZLUDA.
  • Or getting asked if im an AMD User to let it install ZLUDA Stuff
  • Ontrainer working with HIP SDK 6.1

Relevant log output

activating venv D:\Programme\AI-Zeug\OneTrainer\venv
Using Python "D:\Programme\AI-Zeug\OneTrainer\venv\Scripts\python.exe"
Failed to load ZLUDA: Could not find module 'C:\Program Files\AMD\ROCm\6.1\bin\hiprtc0507.dll' (or one of its dependencies). Try using the full path with constructor syntax.

Output of pip freeze

absl-py==2.1.0 accelerate==0.30.1 aiohappyeyeballs==2.3.4 aiohttp==3.10.0 aiosignal==1.3.1 antlr4-python3-runtime==4.9.3 async-timeout==4.0.3 attrs==24.1.0 bitsandbytes==0.43.1 certifi==2024.7.4 charset-normalizer==3.3.2 cloudpickle==3.0.0 colorama==0.4.6 coloredlogs==15.0.1 contourpy==1.2.1 customtkinter==5.2.2 cycler==0.12.1 dadaptation==3.2 darkdetect==0.8.0 -e git+https://github.com/huggingface/diffusers.git@dd4b731e68f88f58dfabfb68f28e00ede2bb90ae#egg=diffusers filelock==3.15.4 flatbuffers==24.3.25 fonttools==4.53.1 frozenlist==1.4.1 fsspec==2024.6.1 ftfy==6.2.0 grpcio==1.65.4 huggingface-hub==0.23.3 humanfriendly==10.0 idna==3.7 importlib_metadata==8.2.0 intel-openmp==2021.4.0 invisible-watermark==0.2.0 Jinja2==3.1.4 kiwisolver==1.4.5 lightning-utilities==0.11.6 lion-pytorch==0.1.4 Markdown==3.6 markdown-it-py==3.0.0 MarkupSafe==2.1.5 matplotlib==3.9.0 mdurl==0.1.2 -e git+https://github.com/Nerogar/mgds.git@901d0767033aa33a73a647cb447df64f3df1c6bc#egg=mgds mkl==2021.4.0 mpmath==1.3.0 multidict==6.0.5 networkx==3.3 numpy==1.26.4 omegaconf==2.3.0 onnxruntime-gpu==1.18.0 open-clip-torch==2.24.0 opencv-python==4.9.0.80 packaging==24.1 pillow==10.3.0 platformdirs==4.2.2 pooch==1.8.1 prodigyopt==1.0 protobuf==4.25.4 psutil==6.0.0 Pygments==2.18.0 pynvml==11.5.0 pyparsing==3.1.2 pyreadline3==3.4.1 python-dateutil==2.9.0.post0 pytorch-lightning==2.2.5 pytorch_optimizer==3.0.2 PyWavelets==1.6.0 PyYAML==6.0.1 regex==2024.7.24 requests==2.32.3 rich==13.7.1 safetensors==0.4.3 scalene==1.5.41 schedulefree==1.2.5 sentencepiece==0.2.0 six==1.16.0 sympy==1.13.1 tbb==2021.13.0 tensorboard==2.17.0 tensorboard-data-server==0.7.2 timm==1.0.8 tokenizers==0.19.1 torch==2.3.1+cu118 torchmetrics==1.4.1 torchvision==0.18.1+cu118 tqdm==4.66.4 transformers==4.42.3 typing_extensions==4.12.2 urllib3==2.2.2 wcwidth==0.2.13 Werkzeug==3.0.3 xformers==0.0.27+cu118 yarl==1.9.4 zipp==3.19.2

CS1o avatar Aug 04 '24 18:08 CS1o

Unfortunately neither myself nor Nerogar have AMD cards so we can't verify any changes to bring it up to the modern day. I'll see if anyone in the discord is having trouble and/or needs to change anything.

mx avatar Aug 04 '24 19:08 mx

Related to #422

LeagueRaINi avatar Aug 04 '24 20:08 LeagueRaINi

By default it tries to use HIP SDK 5.7. If you have 6.1 installed instead, you need to edit ZludaInstaller to install a correct build of Zluda and use 6.1 dll and path.

AznamirWoW avatar Aug 04 '24 20:08 AznamirWoW

2024-08-08 05:54:33.2622693 [W:onnxruntime:, session_state.cc:1166 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2024-08-08 05:54:33.2661444 [W:onnxruntime:, session_state.cc:1168 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. 2024-08-08 05:54:33.4331646 [E:onnxruntime:, inference_session.cc:2045 onnxruntime::InferenceSession::Initialize::<lambda_d4e0caa0782683b2ee97e3859f73dc9c>::operator ()] Exception during initialization: C:\a\_work\1\s\onnxruntime\core\providers\cuda\cuda_call.cc:123 onnxruntime::CudaCall C:\a\_work\1\s\onnxruntime\core\providers\cuda\cuda_call.cc:116 onnxruntime::CudaCall CUDNN failure 4: CUDNN_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=DESKTOP_SILJA ; file=C:\a\_work\1\s\onnxruntime\core\providers\cuda\cuda_execution_provider.cc ; line=182 ; expr=cudnnSetStream(cudnn_handle_, stream); When trying to Get captions.

Changing instalation to Requirments.txt from -r Requirments-cuda.txt to -r Requiremnts-default.txt "Fixes" this issue somehow. Also zluda kinda works but never actually starts.

VeteranXT avatar Aug 08 '24 03:08 VeteranXT

https://www.phoronix.com/news/AMD-ZLUDA-CUDA-Taken-Down

O-J1 avatar Aug 08 '24 05:08 O-J1

https://www.phoronix.com/news/AMD-ZLUDA-CUDA-Taken-Down

This worked fine, it was broken after update. Zluda cuda was taken down, There is no additional Developments with it. What worked should still work.

VeteranXT avatar Aug 08 '24 08:08 VeteranXT

CUDNN_STATUS_INTERNAL_ERROR

This happens because Zluda does not support CUDNN and CUDNN has to be disabled in the application. Unfortunately that is not possible for Onnxruntime's CUDAExecutionProvider.

see https://github.com/microsoft/onnxruntime/discussions/18083

AznamirWoW avatar Aug 08 '24 09:08 AznamirWoW

Installed Latest version of it. without thinkering Untitled

This is what i get Caching never goes off.

VeteranXT avatar Aug 08 '24 10:08 VeteranXT

Installed Latest version of it. without thinkering !

CUBLAS_STATUS error is happening because of the unpatched torch/lib, OneTrainer uses the same DLL substitution as SD.Next so you dont need to copy DLLs over manually, but as I said previously, if you're using HIP SDK 6.1 you need to manually edit ZludaInstaller,py after cloning but before running install.bat to properly install everything.

AznamirWoW avatar Aug 08 '24 16:08 AznamirWoW

I'm using HIP 5.7.0. So i dont need to edit anything? ROCm is 5.7. I didn't update anything.

VeteranXT avatar Aug 08 '24 17:08 VeteranXT

I'm using HIP 5.7.0. So i dont need to edit anything? ROCm is 5.7. I didn't update anything.

HIP SDK 5.7 is what OneTrainer installer is using by default

AznamirWoW avatar Aug 09 '24 01:08 AznamirWoW

I tried the OneTrainer Fork of @LeagueRaINi which has his fix from here included: https://github.com/Nerogar/OneTrainer/pull/422

And it successfully worked with HIP SDK 6.1 The patch also is compatible with HIP SDK 5.7 and any future releases of HIP SDK.

So please merge his commit from above into OneTrainer and everything should be fine.

CS1o avatar Aug 09 '24 11:08 CS1o

I can confirm this works using: AMD RX 6600 XT Thank you CS1o!

VeteranXT avatar Aug 11 '24 15:08 VeteranXT

Tentatively closing this given LeagueRaINI does not want to further develop the PR and no one in the OT dev team has a AMD gpu + legal concerns. If someone finds official binaries that we can actually legally use and has a card + knows how to develop and wants to continue this then please talk with Nerogar in the discord.

O-J1 avatar Oct 13 '24 17:10 O-J1

I have AMD RX 6600 XT and i can help. But link to Discord is needed.

VeteranXT avatar Oct 15 '24 17:10 VeteranXT

I have AMD RX 6600 XT and i can help. But link to Discord is needed.

The permanent link is on the front page of the repo right at the very top. Just click on the “7XX online” blue button on said front page of the OT repo.

Please note you must satisfy everything I listed. People joining to harass Nero will be moderated accordingly.

O-J1 avatar Oct 15 '24 18:10 O-J1

Hey, it makes no sense why this got closed. LeagueRaini provided a commit that only has to get merged that adds support for every present and upcoming HIP SDK Version. We already tested his fix on severall GPUs with ROCm HIP SDK 6.1 and its working without problems. ZLUDA also isnt abandonend and will continue to be developed. If legal concerns would be an issue why then support zluda with 5.7 ?

Please merge the commit. The AMD Community would really appreciate it! Or does LeagueRaini needs to add something to the PR?

Edit: In any case someone wants to use OneTrainer with ROCm HIP SDK 6.1 or 6.2, here is my Setup Guide which uses lshggtigers OneTrainer fork with Zluda. https://github.com/CS1o/Stable-Diffusion-Info/wiki/Lora-Trainer-Setup-Guides

CS1o avatar Oct 22 '24 17:10 CS1o