Tensile [Bug]: rocBLAS error: Cannot read TensileLibrary.dat: No such file or directory

Describe the bug

Basically getting some form of this error, either rocBLAS error: Cannot read /opt/rocm-5.4.0/lib/rocblas/library/TensileLibrary.dat: Illegal seek or Cannot read TensileLibrary.dat: No such file or directory

To Reproduce

rocblas-dev (= 2.46.0.50400-72~22.04)

Steps to reproduce the behavior:

Basically followed the guide on AMD Docs to install ROCm along with almost all usecases because I kept having errors with packages missing.
Built wheel for onnxruntime using docker
Trying to run the python application (roop) and getting this error

Expected behavior

No error?

Log-files

Aborted (core dumped)
(roop) hobi@hobi:~/roop$ ~python run.py --execution-provider rocm --execution-threads
Command '~python' not found, did you mean:
  command 'bpython' from deb bpython (0.22.1-2)
  command 'xpython' from deb xpython (0.12.5-1build1)
Try: sudo apt install <deb name>
(roop) hobi@hobi:~/roop$ python run.py --execution-provider rocm --execution-threads
usage: run.py [-h] [-s SOURCE_PATHS] [-t TARGET_PATHS] [-o OUTPUT_PATH] [--frame-processor {face_swapper,face_enhancer} [{face_swapper,face_enhancer} ...]] [--keep-fps] [--keep-audio] [--keep-frames] [--keep-filenames] [--many-faces]
              [--video-encoder {libx264,libx265,libvpx-vp9}] [--video-quality [0-51]] [--max-memory MAX_MEMORY] [--execution-provider {rocm,cpu} [{rocm,cpu} ...]] [--execution-threads EXECUTION_THREADS] [-v]
run.py: error: argument --execution-threads: expected one argument
(roop) hobi@hobi:~/roop$ python run.py --execution-provider rocm --execution-threads 2
[ROOP.CORE] Creating temp resources...
[ROOP.CORE] Extracting frames...
Applied providers: ['ROCMExecutionProvider', 'CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}, 'ROCMExecutionProvider': {'tunable_op_tuning_enable': '0', 'do_copy_in_default_stream': '1', 'miopen_conv_exhaustive_search': '0', 'arena_extend_strategy': 'kNextPowerOfTwo', 'gpu_external_empty_cache': '0', 'gpu_external_free': '0', 'miopen_conv_use_max_workspace': '1', 'gpu_mem_limit': '18446744073709551615', 'tunable_op_enable': '0', 'gpu_external_alloc': '0', 'device_id': '0'}}
find model: /home/hobi/.insightface/models/buffalo_l/1k3d68.onnx landmark_3d_68 ['None', 3, 192, 192] 0.0 1.0
Applied providers: ['ROCMExecutionProvider', 'CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}, 'ROCMExecutionProvider': {'tunable_op_tuning_enable': '0', 'do_copy_in_default_stream': '1', 'miopen_conv_exhaustive_search': '0', 'arena_extend_strategy': 'kNextPowerOfTwo', 'gpu_external_empty_cache': '0', 'gpu_external_free': '0', 'miopen_conv_use_max_workspace': '1', 'gpu_mem_limit': '18446744073709551615', 'tunable_op_enable': '0', 'gpu_external_alloc': '0', 'device_id': '0'}}
find model: /home/hobi/.insightface/models/buffalo_l/2d106det.onnx landmark_2d_106 ['None', 3, 192, 192] 0.0 1.0
Applied providers: ['ROCMExecutionProvider', 'CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}, 'ROCMExecutionProvider': {'tunable_op_tuning_enable': '0', 'do_copy_in_default_stream': '1', 'miopen_conv_exhaustive_search': '0', 'arena_extend_strategy': 'kNextPowerOfTwo', 'gpu_external_empty_cache': '0', 'gpu_external_free': '0', 'miopen_conv_use_max_workspace': '1', 'gpu_mem_limit': '18446744073709551615', 'tunable_op_enable': '0', 'gpu_external_alloc': '0', 'device_id': '0'}}
find model: /home/hobi/.insightface/models/buffalo_l/det_10g.onnx detection [1, 3, '?', '?'] 127.5 128.0
Applied providers: ['ROCMExecutionProvider', 'CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}, 'ROCMExecutionProvider': {'tunable_op_tuning_enable': '0', 'do_copy_in_default_stream': '1', 'miopen_conv_exhaustive_search': '0', 'arena_extend_strategy': 'kNextPowerOfTwo', 'gpu_external_empty_cache': '0', 'gpu_external_free': '0', 'miopen_conv_use_max_workspace': '1', 'gpu_mem_limit': '18446744073709551615', 'tunable_op_enable': '0', 'gpu_external_alloc': '0', 'device_id': '0'}}
find model: /home/hobi/.insightface/models/buffalo_l/genderage.onnx genderage ['None', 3, 96, 96] 0.0 1.0
Applied providers: ['ROCMExecutionProvider', 'CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}, 'ROCMExecutionProvider': {'tunable_op_tuning_enable': '0', 'do_copy_in_default_stream': '1', 'miopen_conv_exhaustive_search': '0', 'arena_extend_strategy': 'kNextPowerOfTwo', 'gpu_external_empty_cache': '0', 'gpu_external_free': '0', 'miopen_conv_use_max_workspace': '1', 'gpu_mem_limit': '18446744073709551615', 'tunable_op_enable': '0', 'gpu_external_alloc': '0', 'device_id': '0'}}
find model: /home/hobi/.insightface/models/buffalo_l/w600k_r50.onnx recognition ['None', 3, 112, 112] 127.5 127.5
set det-size: (640, 640)

rocBLAS error: Cannot read /opt/rocm-5.4.0/lib/rocblas/library/TensileLibrary.dat: No such file or directory
Aborted (core dumped)

Environment

Hardware	description
CPU	Rzyen 5 5600
GPU	Radeon 6700XT

Software	version
rocm-core	v5.4.0.50400-72~22.04
rocblas	v2.46.0.50400-72~22.04

Make sure that ROCm is correctly installed and to capture detailed environment information run the following command:

printf '=== environment\n' > environment.txt &&
printf '\n\n=== date\n' >> environment.txt && date >> environment.txt &&
printf '\n\n=== Linux Kernel\n' >> environment.txt && uname -a  >> environment.txt &&
printf '\n\n=== rocm-smi' >> environment.txt && rocm-smi  >> environment.txt &&
printf '\n\n' >> environment.txt && hipconfig  >> environment.txt &&
printf '\n\n=== rocminfo\n' >> environment.txt && rocminfo  >> environment.txt &&
printf '\n\n=== lspci VGA\n' >> environment.txt && lspci | grep -i vga >> environment.txt

Getting this error: ```No LSB modules are available.



### Additional context
I am super new to machine learning and I am having a nightmare time of making things work with ROCm. Pretty much at the end of the rope here guys. Any help would be appreciated. Thank you.

Jul 04 '23 13:07 slipperyslipped

Hi @slipperyslipped. Your GPU uses the gfx1031 instruction set, but the binaries distributed by AMD are not built for that architecture as it is not officially supported. However, the gfx1030 instruction set is identical to the gfx1031 instruction set in all but name. For this reason, there are ways to get the existing binaries running on your GPU.

As a workaround, I would recommend setting the environment variable export HSA_OVERRIDE_GFX_VERSION=10.3.0. This will cause your GPU to report that it supports the gfx1030 instruction set, which is included in the AMD-provided binaries. I've confirmed that this works correctly with rocBLAS on the RX 6750 XT. I believe this workaround is generally applicable to any discrete RDNA 2 GPUs.

Jul 05 '23 18:07 cgmb

Hi, I was blocked by the same problem "rocBLAS error: Cannot read /home/bc250/Desktop/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/lib/rocblas/library/TensileLibrary.dat: No such file or directory" when launching stable-diffusion-webui.

I am using a gfx 1013 device. Can I set pytorch not to use rocBlas for this ?

Aug 10 '23 07:08 jasber9999

I'm not an expert on PyTorch, but the gfx1013 ISA is a superset of the gfx1010 ISA. You can set export HSA_OVERRIDE_GFX_VERSION=10.1.0 and it will probably work. With that said, it is obviously not an officially supported configuration. You may want to build and run the rocBLAS test suite to check that the library functions correctly on your hardware with that workaround.

Aug 14 '23 22:08 cgmb

@cgmb gfx1010 produces the same issue:

$ drun --rm rocm/dev-ubuntu-22.04:5.6-complete
root@ftl:/# ls -1 /opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx*
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1100.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1101.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1102.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx803.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx900.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx906.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx908.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx90a.dat

As you see there is no TensileLibrary_lazy_gfx1010.dat in the official container, while rocBLAS does build with gfx1010 enabled, Tensile is not producing the library, see https://github.com/ROCmSoftwarePlatform/Tensile/issues/1757

Aug 17 '23 18:08 ulyssesrr

Thanks @ulyssesrr. That's a great analysis of the problem.

It's perhaps worth noting that the OS-provided rocBLAS package on Debian 13 (Testing/Trixie) and the upcoming Ubuntu 23.10 (Mantic Minotaur) builds Tensile with --merge-architectures --no-lazy-library-loading. For users on RDNA 1 hardware, that may be a good option until the problem is fixed in the AMD releases.

The OS-provided package for rocBLAS on Debian/Ubuntu also automatically handles loading code objects for ISAs that are known to be compatible as I'd suggested earlier in this thread. For this reason, the OS-provided package has much wider hardware compatibility than the AMD-provided package on GFX9 and GFX10 hardware.

I have not tested the OS-provided packages on all hardware platforms, but the tests are also packaged in the OS package librocblas0-tests (which entered Debian Unstable today and should migrate to Trixie next week), so you can run the tests on your own system to determine if it will work on your hardware.

Just mentioning it, since that's probably a useful workaround for some people on hardware that is not officially supported. Even folks on other operating systems could potentially spin up a docker container with an Ubuntu or Debian image and apt install librocblas-dev.

Aug 17 '23 19:08 cgmb

@cgmb I forgot to mention that the rocBLAS build script on 5.6.0 seems to have an issue where --merge-architectures and --no-lazy-library-loading have no effect, I stumbled on that when trying the workaround.

The rmake.py script treats the cmake flags Tensile_LAZY_LIBRARY_LOADING and Tensile_SEPARATE_ARCHITECTURES as opt-in. https://github.com/ROCmSoftwarePlatform/rocBLAS/blob/a5ef7c59507e6601a18539f42a088fe63bffaa5a/rmake.py#L371-L374

However I was getting them enabled by default, thus I had to actually opt-out, I'm guessing it is being done here: https://github.com/ROCmSoftwarePlatform/rocBLAS/blob/a5ef7c59507e6601a18539f42a088fe63bffaa5a/cmake/build-options.cmake#L75-L76

I didn't debug much, just rolled a patch and went my way(Which I ended not needing as I patched the Tensile issue): https://github.com/ulyssesrr/docker-rocm-gfx803/blob/main/rocm-xtra-rocblas-builder/patches/deactivated/rocBLAS-fix_cmake_options.patch

As I didn't debug much, I didn't feel confident to open an Issue.

Aug 17 '23 20:08 ulyssesrr

FYI seeing what seems to be the same TensileLibrary.dat: Illegal seek issue on Ubuntu 22.04 LTS and Radeon Software for Linux 23.20.

GPU is a 7800 XT.

Stack from running a basic PyTorch example under GDB is shown below. I did have to override gfx version to either 11.0.0 or 11.0.1 for it to see GPU at all but I forget which.

rocBLAS error: Cannot read /home/redacted/venv/lib/python3.10/site-packages/torch/lib/rocblas/library/TensileLibrary.dat: Illegal seek

Thread 1 "python3" received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352507392) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
(gdb) where
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352507392) at ./nptl/pthread_kill.c:44
ROCm/rocBLAS#1  __pthread_kill_internal (signo=6, threadid=140737352507392) at ./nptl/pthread_kill.c:78
ROCm/rocBLAS#2  __GI___pthread_kill (threadid=140737352507392, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
ROCm/rocBLAS#3  0x00007ffff7c42476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
ROCm/rocBLAS#4  0x00007ffff7c287f3 in __GI_abort () at ./stdlib/abort.c:79
ROCm/rocBLAS#5  0x00007fff4e341ccf in rocblas_abort_once() () from /home/redacted/venv/lib/python3.10/site-packages/torch/lib/librocblas.so
ROCm/rocBLAS#6  0x00007fff4e341c49 in rocblas_abort () from /home/redacted/venv/lib/python3.10/site-packages/torch/lib/librocblas.so
ROCm/rocBLAS#7  0x00007fff4dc44633 in (anonymous namespace)::TensileHost::initialize(Tensile::hip::SolutionAdapter&, int) ()
   from /home/redacted/venv/lib/python3.10/site-packages/torch/lib/librocblas.so
ROCm/rocBLAS#8  0x00007fff4dc33929 in (anonymous namespace)::get_library_and_adapter(std::shared_ptr<Tensile::MasterSolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution> >*, std::shared_ptr<hipDeviceProp_t>*, int) () from /home/redacted/venv/lib/python3.10/site-packages/torch/lib/librocblas.so
ROCm/rocBLAS#9  0x00007fff4dc46b6c in rocblas_status_ runContractionProblem<float, float, float>(RocblasContractionProblem<float, float, float> const&, rocblas_gemm_algo_, int) ()
   from /home/redacted/venv/lib/python3.10/site-packages/torch/lib/librocblas.so
[SNIP]

Sep 13 '23 20:09 2eQTu

FYI seeing what seems to be the same TensileLibrary.dat: Illegal seek issue on Ubuntu 22.04 LTS and Radeon Software for Linux 23.20.

GPU is a 7800 XT.

Stack from running a basic PyTorch example under GDB is shown below. I did have to override gfx version to either 11.0.0 or 11.0.1 for it to see GPU at all but I forget which.

rocBLAS error: Cannot read /home/redacted/venv/lib/python3.10/site-packages/torch/lib/rocblas/library/TensileLibrary.dat: Illegal seek

Thread 1 "python3" received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352507392) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
(gdb) where
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352507392) at ./nptl/pthread_kill.c:44
ROCm/rocBLAS#1  __pthread_kill_internal (signo=6, threadid=140737352507392) at ./nptl/pthread_kill.c:78
ROCm/rocBLAS#2  __GI___pthread_kill (threadid=140737352507392, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
ROCm/rocBLAS#3  0x00007ffff7c42476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
ROCm/rocBLAS#4  0x00007ffff7c287f3 in __GI_abort () at ./stdlib/abort.c:79
ROCm/rocBLAS#5  0x00007fff4e341ccf in rocblas_abort_once() () from /home/redacted/venv/lib/python3.10/site-packages/torch/lib/librocblas.so
ROCm/rocBLAS#6  0x00007fff4e341c49 in rocblas_abort () from /home/redacted/venv/lib/python3.10/site-packages/torch/lib/librocblas.so
ROCm/rocBLAS#7  0x00007fff4dc44633 in (anonymous namespace)::TensileHost::initialize(Tensile::hip::SolutionAdapter&, int) ()
   from /home/redacted/venv/lib/python3.10/site-packages/torch/lib/librocblas.so
ROCm/rocBLAS#8  0x00007fff4dc33929 in (anonymous namespace)::get_library_and_adapter(std::shared_ptr<Tensile::MasterSolutionLibrary<Tensile::ContractionProblem, Tensile::ContractionSolution> >*, std::shared_ptr<hipDeviceProp_t>*, int) () from /home/redacted/venv/lib/python3.10/site-packages/torch/lib/librocblas.so
ROCm/rocBLAS#9  0x00007fff4dc46b6c in rocblas_status_ runContractionProblem<float, float, float>(RocblasContractionProblem<float, float, float> const&, rocblas_gemm_algo_, int) ()
   from /home/redacted/venv/lib/python3.10/site-packages/torch/lib/librocblas.so
[SNIP]

Did you only install the Radeon Software or did you also install ROCm?

Sep 13 '23 22:09 YellowRoseCx

@YellowRoseCx Yes, rocm was installed. But there were some errors and perhaps there is a version mismatch. I have since reinstalled the whole machine and here is the current state:

Software	version
rocm-core	5.7.0.50700-45~22.04
rocblas	3.1.0.50700-45~22.04
uname -r	6.2.0-32-generic
rocminfo	[...] gfx1101

Same segfault and stack looks similar.

Here is a basic log of what I tried this time:

python3 -m venv ptroc561-nightly
cd ptroc561-nightly/
source bin/activate
pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/rocm5.6/
python3 -c 'import torch; print(torch.cuda.is_available())'
[TRUE]
git clone https://github.com/pytorch/examples.git
cd examples/mnist
python3 main.py
rocBLAS error: Cannot read /path/to/venv/ptroc561-nightly/lib/python3.10/site-packages/torch/lib/rocblas/library/TensileLibrary.dat: Illegal seek
Aborted (core dumped)

This time I opted for the AMDGPU install flow option in the ROCm install guide. Running the installer from the amdgpu-install_5.6.50601-1_all.deb as specified did not result in a system where rocminfo saw a GPU. A newer amdgpu-install_5.7.50700-1_all.deb file I found on the server seemed to work. But the error is still the same as before. No env overrides needed this time, oddly enough.

Note PyTorch repo is nightly/rocm5.6/. When I tried to substitute nightly/rocm5.7/, it just installed some cuda flavors. I'm attempting to build PyTorch for ROCm from source on bare metal. We'll see how that goes.

Sep 14 '23 06:09 2eQTu

GPU is a 7800 XT.

Stack from running a basic PyTorch example under GDB is shown below. I did have to override gfx version to either 11.0.0 or 11.0.1 for it to see GPU at all but I forget which.

The RX 7800 XT (Navi 32) is gfx1101. You likely were overriding the gfx version to 11.0.0. However, that is not safe. The gfx1100 ISA has more registers than the gfx1101 ISA and there are other important differences in the ABI too.

With Navi 21/22/23/24, the gfx version override approach more or less worked, despite not being officially supported. Users execute code built for Navi 21 on any of those chips and I don't know of any problems encountered from doing so. The compiler handled each of those ISAs identically. Navi 31/32/33 are not like that. There are known differences between those chips that the compiler is accounting for when it generates code for each architecture.

(This isn't the cause of the specific TensileLibrary.dat error you encountered, but it's a warning that you may encounter other problems even once the Tensile issue is resolved, if you're using that override.)

Sep 14 '23 16:09 cgmb

@cgmb Thanks for the ISA incompatibility heads up for Navi 31/32/33. Good to know.

I actually had just started going through the RDNA 3 ISA doc, but did not notice any chip-specific differences called out so far. Is there other documentation I should review, or will there eventually be updates to highlight differences? Since this is off-topic for this issue, is there a better place to follow (or open) an issue wrt to documentation?

Sep 14 '23 23:09 2eQTu

JFYI I got working stable diffusion automatic with rocm 5.7 working on Phoenix APU (7840u) via setting it to 11.0.0

export HSA_OVERRIDE_GFX_VERSION=11.0.0

Without this override I got

rocBLAS error: Cannot read /home/shtirlic/stable-diffusion-webui/venv/lib/python3.11/site-packages/torch/lib/rocblas/library/TensileLibrary.dat:
 No such file or directory for GPU arch : gfx1103
 List of available TensileLibrary Files :
"/home/shtirlic/stable-diffusion-webui/venv/lib/python3.11/site-packages/torch/lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat"
"/home/shtirlic/stable-diffusion-webui/venv/lib/python3.11/site-packages/torch/lib/rocblas/library/TensileLibrary_lazy_gfx1100.dat"
"/home/shtirlic/stable-diffusion-webui/venv/lib/python3.11/site-packages/torch/lib/rocblas/library/TensileLibrary_lazy_gfx900.dat"
"/home/shtirlic/stable-diffusion-webui/venv/lib/python3.11/site-packages/torch/lib/rocblas/library/TensileLibrary_lazy_gfx906.dat"
"/home/shtirlic/stable-diffusion-webui/venv/lib/python3.11/site-packages/torch/lib/rocblas/library/TensileLibrary_lazy_gfx908.dat"
"/home/shtirlic/stable-diffusion-webui/venv/lib/python3.11/site-packages/torch/lib/rocblas/library/TensileLibrary_lazy_gfx90a.dat"

Nov 26 '23 10:11 shtirlic

For other arch such as gfx1103, I think the right way to use it is to generate a new TensileLibrary.dat file to get optimal performance. Do we have a way to trigger this process?

Dec 22 '23 18:12 geekboood

@TorreZuk can you take a look or merge it please 😢 https://github.com/ROCm/Tensile/pull/1862 My code wont run without it on rx 6600 xt

Jan 19 '24 07:01 hiepxanh

@TorreZuk can you take a look or merge it please 😢 ROCm/Tensile#1862 My code wont run without it on rx 6600 xt

@hiepxanh sure I will push to see if it can get reviewed sooner rather than later.

Jan 19 '24 15:01 TorreZuk

Tried to get my 6650 XT to work with llama.ccp by installing rocm-hip-sdk and got the same error after I think it failed to properly build on first launch:

./mistral-7b-instruct-v0.2.Q5_K_M.llamafile -ngl 999
import_cuda_impl: initializing gpu module...
get_rocm_bin_path: note: amdclang++ not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/amdclang++ does not exist
get_rocm_bin_path: note: hipInfo not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/hipInfo does not exist
get_rocm_bin_path: note: /opt/rocm/bin/hipInfo does not exist
llamafile_log_command: /usr/bin/rocminfo
llamafile_log_command: hipcc -O3 -fPIC -shared -DNDEBUG --offload-arch=gfx1032 -march=native -mtune=native -use_fast_math -DGGML_BUILD=1 -DGGML_SHARED=1 -Wno-return-type -Wno-unused-result -DGGML_USE_HIPBLAS -DGGML_CUDA_MMV_Y=1 -DGGML_MULTIPLATFORM -DGGML_CUDA_DMMV_X=32 -DIGNORE4 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DIGNORE -o /home/*****/.llamafile/ggml-rocm.so.ikigfn /home/*****/.llamafile/ggml-cuda.cu -lhipblas -lrocblas
/home/*****/.llamafile/ggml-cuda.cu:408:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
}
^
/home/*****/.llamafile/ggml-cuda.cu:777:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
}
^
/home/*****/.llamafile/ggml-cuda.cu:5132:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
    mul_mat_q4_K(
    ^
/home/*****/.llamafile/ggml-cuda.cu:5132:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/*****/.llamafile/ggml-cuda.cu:5199:1: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
mul_mat_q5_K(
^
/home/*****/.llamafile/ggml-cuda.cu:5199:1: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/*****/.llamafile/ggml-cuda.cu:5268:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
    mul_mat_q6_K(
    ^
/home/*****/.llamafile/ggml-cuda.cu:5268:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/*****/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
static __global__ void soft_max_f32(const float * x, const float * y, float * dst, const int ncols_par, const int nrows_y, const float scale) {
                       ^
/home/*****/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/*****/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/*****/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/*****/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/*****/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
14 warnings generated when compiling for gfx1032.
/home/*****/.llamafile/ggml-cuda.cu:408:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
}
^
/home/*****/.llamafile/ggml-cuda.cu:777:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
}
^
2 warnings generated when compiling for host.
link_cuda_dso: note: dynamically linking /home/*****/.llamafile/ggml-rocm.so
ggml_cuda_link: welcome to ROCm SDK with hipBLAS
link_cuda_dso: GPU support linked

rocBLAS error: Cannot read /opt/rocm-5.6.1/lib/rocblas/library/TensileLibrary.dat: No such file or directory
Aborted (core dumped)

Launching through the gpu again just gives me the last error now.

Feb 18 '24 18:02 Dark-Thoughts

@NaturalHate, build for gfx1030 and run with export HSA_OVERRIDE_GFX_VERSION=10.3.0 set in your environment.

Feb 19 '24 22:02 cgmb

@NaturalHate, build for gfx1030 and run with export HSA_OVERRIDE_GFX_VERSION=10.3.0 set in your environment.

If i have to build it myself then I guess I'll pass.

Feb 20 '24 10:02 Dark-Thoughts

No i can send you if you use rx6600 there is a lot of people already build it. Just copy pate and it run

Feb 20 '24 12:02 hiepxanh

I don't. I use a 6650 XT.

Feb 20 '24 14:02 Dark-Thoughts

@NaturalHate, build for gfx1030 and run with export HSA_OVERRIDE_GFX_VERSION=10.3.0 set in your environment.

@hiepxanh Hey taking my moment to thank you:) I use rx6600 XT and the environment variable saved me!

@NaturalHate I'm not expert on those hardware stuff but from your error message the architecture is gfx1032. Even if you use 6650 XT and I'm using 6600 XT, they might share the same "series" from software perspective. Maybe that works... Doesn't hurt to try right?

Feb 27 '24 14:02 wayneyaoo

@NaturalHate https://github.com/LostRuins/koboldcpp/issues/441 gfx1032_none_lazy.zip

He gave me this file on koboldcpp, it work, you can try it since it the same 1032 platform. AMD should embeding it since it just 1,8mb :(

@wayneyaoo you are welcome, I digging a lot and I think I should save others time, this issue is really frustrated

Feb 28 '24 03:02 hiepxanh

Thanks again for bringing this issue to our attention. We noticed that there hasn't been any activity on this issue for a while. To keep our issue tracker clean and focused on active matters, we will be closing this issue if there is no further activity within the next week.

If you still require assistance or believe this issue needs to remain open, please provide any additional information or updates at your earliest convenience. I suggest to open related issues in ROCm/ROCm as it should be directed toward general hardware compatibility and support.

Thank you for your understanding and cooperation.

Jun 18 '24 01:06 mahmoodw

@mahmoodw, thank for keeping an eye on stale issues. I think, this one is just waiting for ROCm 6.2. But that doesn't mean that issue no longer exists.

On NixOS 23.11 I was able to use work-around that -DTensile_SEPARATE_ARCHITECTURES=OFF and -DTensile_LAZY_LIBRARY_LOADING=OFF to support gfx1010.
But with upgrade to NixOS 24.05 and use of ROCm 6.0.2 this doesn't help anymore.
Right now I have no way to use my GPU with recent Ollama as it requires ROCm v6.

Jun 18 '24 06:06 ony

No i can send you if you use rx6600 there is a lot of people already build it. Just copy pate and it run

could you please help me?My GPU is 6600 !

Jul 22 '24 09:07 younijia

Having an identical issue with my RX 5700XT rocBLAS error: Cannot read /home/extocine/sd-scripts/venv/lib/python3.10/site-packages/torch/lib/rocblas/library/TensileLibrary.dat: No such file or directory for GPU arch : gfx1010

Sep 20 '24 18:09 Extocine

@NaturalHate LostRuins/koboldcpp#441 gfx1032_none_lazy.zip

He gave me this file on koboldcpp, it work, you can try it since it the same 1032 platform. AMD should embeding it since it just 1,8mb :(

@wayneyaoo you are welcome, I digging a lot and I think I should save others time, this issue is really frustrated

my card is RX6600 using koboldcpp ROCm https://github.com/YellowRoseCx/koboldcpp-rocm worked. If you want to have tensileLibrary, you can copy the zip file @younijia

you can use the same method too, since RX 6600XT is using same arch @Extocine

Sep 21 '24 01:09 hiepxanh

To add on to @mahmoodw 's statement, I'd like to request that those of you who are experiencing a similar issue to the original reporter to please open a new issue, rather than adding onto this one. There are too many individual commenters reporting similar issues just on this ticket, with a number of varying workloads and unsupported/supported hardware.

We will be happy to help you resolve the issues you're encountering, but managing several threads of conversation to solve what might be multiple unrelated issues is untenable here. I will be closing this one.

Thank you for your understanding.

Oct 09 '24 15:10 jamesxu2

Hi @slipperyslipped. Your GPU uses the gfx1031 instruction set, but the binaries distributed by AMD are not built for that architecture as it is not officially supported. However, the gfx1030 instruction set is identical to the gfx1031 instruction set in all but name. For this reason, there are ways to get the existing binaries running on your GPU.

As a workaround, I would recommend setting the environment variable export HSA_OVERRIDE_GFX_VERSION=10.3.0. This will cause your GPU to report that it supports the gfx1030 instruction set, which is included in the AMD-provided binaries. I've confirmed that this works correctly with rocBLAS on the RX 6750 XT. I believe this workaround is generally applicable to any discrete RDNA 2 GPUs.

how can i do this? where u put this command? thanks for the help...

Mar 21 '25 02:03 furtadomarcos