hipBLASLt icon indicating copy to clipboard operation
hipBLASLt copied to clipboard

[Issue]: Could not load /opt/rocm-6.1.3/lib/rocblas/library/TensileLibrary.dat

Open unclemusclez opened this issue 1 year ago • 3 comments

Problem Description

rocblaslt error: Could not load /opt/rocm-6.1.3/lib/hipblaslt/library/TensileLibrary.dat
Segmentation fault

i performed cp /opt/rocm-6.1.3/lib/hipblaslt/library/TensileLibrary_gfx1100.dat /opt/rocm-6.1.3/lib/hipblaslt/library/TensileLibrary.dat otherwise i dont know how to get this file.

I ran the ./install.sh -idc --architecture 'gfx1100' --merge-files --static from the hipBLASLt repository Driver installation via amdgpu-install -y --usecase=wsl,rocm --no-dkms

Operating System

WSL2 Ubuntu 22.04 Windows 11

CPU

7800x3d

GPU

AMD Radeon RX 7900 XT

Other

No response

ROCm Version

ROCm 6.1.3

ROCm Component

hipBLASLt

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

unclemusclez avatar Jun 21 '24 14:06 unclemusclez

Did you load it from pytorch? You may also need to replace the torch libhipblaslt.so

There was a similar issue raised by @lhl: https://github.com/ROCm/hipBLASLt/issues/831

minzhezhou avatar Jul 07 '24 18:07 minzhezhou

831 it seems to be not installed with 6.1.3 software there was no official release in the repository

unclemusclez avatar Jul 08 '24 02:07 unclemusclez

I get this error on linux as well. ubuntu 24.04 7900xtx rocm 6.2

Pointing HIPBLASLT_TENSILE_LIBPATH = hipBLASLt/build/release/Tensile/library causes below error.

rocblaslt error: Cannot read /home/adminl/hipBLASLt/build/release/Tensile/library/TensileLibrary.dat: No such file or directory

rocblaslt error: Could not load /home/adminl/hipBLASLt/build/release/Tensile/library/TensileLibrary.dat Segmentation fault (core dumped)

sleppyrobot avatar Sep 08 '24 17:09 sleppyrobot

Hi @unclemusclez. Internal ticket has been created to investigate your issue. Thanks!

ppanchad-amd avatar Oct 23 '24 18:10 ppanchad-amd

Hi @unclemusclez. Internal ticket has been created to investigate your issue. Thanks!

@ppanchad-amd this may be fixable with ln -s of the correlating lazy load *.dat file to .../TensileLibrary.dat

Thank you for looking into this. I am curious about any progress. At the moment i am exploring multiple GFX platforms, including gfx906 and gfx1100.

unclemusclez avatar Oct 23 '24 20:10 unclemusclez

Hi @unclemusclez, when is this issue occurring? I can see the place in the source code where this error is emitted, and it looks like it should be picking up TensileLibrary_gfx1100.dat; not sure yet why it isn't so I'll try to reproduce this.

schung-amd avatar Oct 25 '24 19:10 schung-amd

@schung-amd it's been some time since tried to compile ROCm for my Windows WSL machine. I think this might be an issue with bitsandbytes but i don't remember at this point. this issue is from 4 months ago. Perhaps you can not replicate this because it relates to the kernel, of which does not exists on WSL linux.

From my experience, if its not working, I just don't worry about it until there is a new WSL-Windows driver update for ROCm.

ROCm 6.1.2 is nice, but really we need 6.2 on Windows. That will bring everything up to date with the modern capabilities of PyTorch and CUDA Cooperative Groups are supported. The current windows drivers for GPU are not even working correctly. We have to downgrade or shared memory is used by default. It's very difficult to troubleshoot the versions/source/environment of things when I'm actively trying to do work.

I'll follow up with this at some point when i come across it again.

unclemusclez avatar Oct 26 '24 00:10 unclemusclez

This should be addressed in ROCm 6.2 with lazy loading (https://github.com/ROCm/hipBLASLt/commit/28eb8258d967f3ccaab5aed891bf40d62cdd099d), so hopefully once WSL for 6.2 is released this is fixed.

ROCm 6.1.2 is nice, but really we need 6.2 on Windows. That will bring everything up to date with the modern capabilities of PyTorch and CUDA Cooperative Groups are supported.

Unfortunately we have no plans at this time to add cooperative groups support on Windows.

@sleppyrobot Are you still encountering this error on Ubuntu? If so, can you provide some steps to reproduce it?

schung-amd avatar Oct 28 '24 18:10 schung-amd

Hey no the issue went away when I changed pytorch version.

As far as the steps to reproduce, I was using ComfyUI and a SDXL model with rocm6.2 pytorch 2.5, any from August to early September would trigger the error, also need to link or point to the hipblast library. @schung-amd

sleppyrobot avatar Oct 28 '24 19:10 sleppyrobot

Unfortunately we have no plans at this time to add cooperative groups support on Windows.

@schung-amd this is a necessity for a lot of video and 3d AI python applications due to their dependency on https://github.com/graphdeco-inria/diff-gaussian-rasterization

is there anyway to have this reprioritized or looked at? For Unreal/Blender pipelines this would be incredible. It is a major reason why I am considering switching to an entirely Linux platform at the moment. I just don't have the resources or time to switch everything.

Of course, there was ZLUDA.

unclemusclez avatar Oct 29 '24 05:10 unclemusclez

I've seen other requests for cooperative groups support on Windows and am reaching out internally to push for support if feasible. That being said, I am unaware of the reason we are not supporting it at this time (i.e. there may be technical barriers) and wouldn't expect support in the near future.

schung-amd avatar Oct 29 '24 13:10 schung-amd