Issue

Tensile won't produce backend libraries for archs without optimized logic files when using --separate-architectures.

Description

According with https://github.com/ROCmSoftwarePlatform/Tensile/issues/1165#issuecomment-1094556880 "gfx1010 has been enabled by default in rocBLAS builds since ROCm 4.3.0." however since rocBLAS does not have optimized logic files for navi10 no library is produced for gfx1010.

$ drun --rm rocm/dev-ubuntu-22.04:5.6-complete
root@ftl:/# ls -1 /opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx*
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1100.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1101.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1102.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx803.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx900.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx906.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx908.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx90a.dat

Expected

Tensile should produce libraries for all requested architectures, using the fallback logic files for archs missing optimized logic files.

Workaround

Building rocBLAS with --merge-architectures --no-lazy-library-loading seems to avoid the issue.

Patch

https://github.com/ulyssesrr/docker-rocm-xtra/blob/3be41a9d79ff4f4324f3f34383b2282529c0c4b7/rocm-xtra-builder-rocblas/patches/Tensile-fix-fallback-arch-build.patch

Aug 17 '23 18:08 ulyssesrr

Although that's probably not the right place, I really needed to say thank you! I've struggled with that basically since my card has been released and finally I was able to fix it because of you.

Doing compute stuff is just a nightmare with AMD, really.

Oct 28 '23 17:10 smirgol

#1862 has an updated version of this patch for ROCm >=5.5.

Jan 11 '24 17:01 GZGavinZhao

This change triggered a fail in rocblas test. We cannot add this change to our release until we solve the issue.

Feb 02 '24 16:02 nakajee

I actually have a fix for a test failure that I just found out today, but may I get a failure log to ensure that it's the same failure I'm getting?

Feb 02 '24 16:02 GZGavinZhao

command line: ./rocblas-test --gtest_output=xml --gtest_color=yes --gtest_filter=*quick*:*pre_checkin*-*known_bug*

Error:

[----------] 32 tests from _/gemm_ex_get_solutions

/var/jenkins_home/workspace/eckin_rocBLAS-internal_develop_2/z344iq86D/rocblas/clients/gtest/../include/blas_ex/testing_gemm_ex_get_solutions.hpp:152: Failure

Value of: status_match(rocblas_status_success, status_)

Actual: false (got rocblas_status_invalid_value instead of rocblas_status_success)

Expected: true

[ FAILED ] _/gemm_ex_get_solutions.blas3_tensile/pre_checkin_gemm_ex_get_solutions_f16_rf16_rf16_rf16_rf32_r_CN_250_250_250_1_250_250_1_250_250, where GetParam() = { function: "gemm_ex_get_solutions", name: "gemm_ex_get_solutions", category: "pre_checkin", known_bug_platforms: "", beta: 1.0, stride_a: 62500, stride_b: 62500, stride_c: 62500, stride_d: 62500, M: 250, N: 250, K: 250, lda: 250, ldb: 250, ldc: 250, ldd: 250, a_type: f16_r, b_type: f16_r, c_type: f16_r, d_type: f16_r, composite_compute_type: invalid, initialization: rand_int, gpu_arch: "", flush_batch_count: 1, transA: 'C', transB: 'N' }

(6 ms)

Feb 02 '24 16:02 nakajee

We are investigating the issue now, but have not found the cause yet.

Feb 02 '24 16:02 nakajee

As long as I tried, this fail does not happen if I revert the fallback change.

Feb 02 '24 16:02 nakajee

I'm building rocBLAS locally to test. I've been working on ISA compatibility improvements in rocBLAS so my local copy has some modifications. With my current modifications my gfx1032 GPU is passing the test you mentioned, so I'm stashing my changes and building the develop branch right now to check if I can reproduce this failure.

While it's building, could you change the if statement at https://github.com/ROCm/rocBLAS/blob/5211f0dca313c56c2163b8602581242c8cb608f1/library/src/tensile_host.cpp#L991C1-L992C43 from

        if(library)
            *library = host.get_library();

to

        if(adapter)
            *library = host.get_library();

and see if you get a segfault (sigsev)?

Feb 02 '24 16:02 GZGavinZhao

Actually. I could not reproduce the fail with my local rocblas build, but it fails on our CI environment. I do not quite understand your change. Could we ignore library==NULL case?

Feb 02 '24 17:02 nakajee

What GPU arch does the CI environment has?

Could we ignore library==NULL case?

No, sometimes library is NULL when execution reaches this if statement. I don't know why library would become NULL, but this has caused several segfaults on me so in my local changes I had to check if library is NULL and assign host.get_library() to library if it is.

Feb 02 '24 17:02 GZGavinZhao

The fail above is gfx1101. I do not have gfx1032 environment.

Feb 02 '24 17:02 nakajee

To clarify, above that line there's a comment:

// If an adapter is found, it is assumed that the library is initialized

If the "library" refers to the library variable, then this doesn't always holds. I have logged the values of adapter and library and sometimes adapter is non-NULL while library is NULL, which will then cause segfaults in runContractionProblem.

Feb 02 '24 17:02 GZGavinZhao

Also does it only fail on Level-3? Or are there also failures with Level-2 and Level-1 operations as well?

Feb 02 '24 17:02 GZGavinZhao

I am not familiar with rocblas side, but if get_library_and_adapter() is called from rocblas_initialize(), library seems to be NULL since library is not specified here.

Feb 02 '24 17:02 nakajee

The fail above is gfx1101. I do not have gfx1032 environment.

I'm confused as to why my change would affect gfx1101, as gfx1101 should have optimized logic files so fallback libraries shouldn't even be compiled? Does the log contain lines like Using fallback for arch: <arch>?

Feb 02 '24 17:02 GZGavinZhao

If we do not specify -a option when building rocblas, rocblas picks Tensile library for all architectures including gfx1010 and 1012 added by the fallback change. For some reason, it affects solution selection (which should not happen).

Feb 02 '24 17:02 nakajee

That is my guess. We still do not understand why it fails.

Feb 02 '24 17:02 nakajee

rocBLAS still compiling. Will report back when I get to run the tests and reproduce the failure.

Feb 02 '24 17:02 GZGavinZhao

A SIGSEV was triggered, let me debug what went wrong.

Edit: the exact failure also reproduced.

Feb 02 '24 17:02 GZGavinZhao

rocblas_gemm_batched_ex_get_solutions is doing something weird. For a Contraction_l_Ailk_Bjlk_Cijk_Dijk problem, it returned a solution index corresponding to Cijk_Alik_Bljk_HB_MT128x64x16_SN_AMAS3_BL1_BS1_EPS0_GLVWA8_GLVWB8_GRVW8_GSU1_GSUASB_ISA1030_IU1_K1_KLA_LDL1_LRVW8_MMFGLC_NLCA1_NLCB1_PGR1_PLR1_SIA1_SU0_SUM0_SUS0_SVW4_TT8_8_USFGROn1_VAW2_VSn1_VW8_VWB8_WS32_WG16_8_1_WGM1. Other similar problems returned indices in the range of a few 3000~5000, while for this particular problem solution with indices 1 and 2 are returned. Will do more investigation during the weekend.

Feb 03 '24 03:02 GZGavinZhao

Putting some investigation notes here. I will spend more time to dig through this later in this week, but if anyone wants to investigate feel free to build on top of here.

Through my tracing I found that solution selection doesn't seem to be affected. If you print out every single solution found in getAllSolutions, you will see that they all correspond to the correct solution. The problem is that despite the library object being the same (verified by printing the address of library), with the same index library->getSolutionByIndex(index) in runContractionProblem and getAllSolutions return different solutions. This is what baffled me. Basically if you run library->getSolutionByIndex(2) in runContractionProblem and getAllSolutions, they will return different solutions despite calling on the same library object. Will investigate further later in this week to see why this happens.

Feb 05 '24 12:02 GZGavinZhao

hello guys, fist of all thanks for all the hard work you doing to make rx5700 work, im just a hobbist and not even close to be near your league of expertice. However i would like to ask if meanwhile the problem is solve, can i do something similar to

export ROCM_PATH=/opt/rocm
export HCC_AMDGPU_TARGET=gfx803
export HSA_OVERRIDE_GFX_VERSION=8.3.0

to make work my rx5700 with llamacpp? want to ask frist before do the test and break my ubuntu lol

Feb 08 '24 06:02 userbox020

hello guys, fist of all thanks for all the hard work you doing to make rx5700 work, im just a hobbist and not even close to be near your league of expertice. However i would like to ask if meanwhile the problem is solve, can i do something similar to
export ROCM_PATH=/opt/rocm
export HCC_AMDGPU_TARGET=gfx803
export HSA_OVERRIDE_GFX_VERSION=8.3.0
to make work my rx5700 with llamacpp? want to ask frist before do the test and break my ubuntu lol

I'm not sure any more if that is all that it takes, because I fiddled a LOT to make my now replaced RX 5700 XT work, but I've used these settings:

export PATH="/opt/rocm/bin:$PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/opt/rocm/lib/"
export HSA_OVERRIDE_GFX_VERSION=10.1.0
export HCC_AMDGPU_TARGET=gfx1010

In some places I've also used

export PATH="/opt/rocm/bin:$PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/opt/rocm/lib/"
export HSA_OVERRIDE_GFX_VERSION=10.3.0
export HCC_AMDGPU_TARGET=gfx1030
ROCM_VERSION=5.6

but in any case I did not use the overrides that you have used, with these low versions / numbers.

That shouldn't break anything, as it is solely related to things that make use of ROCm. Worst case it won't work. Sorry that I can't give you more hints, I have forgotten most of the things I've tried to make it work and a lot of it was blindly poking at things anyway. But these tensile libraries this issue is about definitely did help at some point.

Edit: For LLMs I ended up using mainly https://github.com/YellowRoseCx/koboldcpp-rocm for loading the LLMs and then e.g. SillyTavern for the frontend. I had quite some issues with oobabooga back then, but these might have been resolved meanwhile.

Feb 10 '24 18:02 smirgol

The way that separate architectures and lazy loading were implemented was really not ideal. The complexity of building all the necessary data structures during initialization should really be pushed to build time, and there should be no meaningful logic executing during initialization. The initialization could be nothing more than read or mmap and then there would be no need to split the dat files at all, because it would be so fast to load that you could read everything in a fraction of the time the current implementation takes. Or, you could read the parts you needed on-demand. There has been so much complexity introduced into Tensile just to avoid fixing the underlying data representation on disk.

The use of an unindexed key-value pair format like msgpack is the underlying cause of these bugs, because the slow conversion of that data into the Tensile in-memory format drives the introduction of complicated logic to try to be clever about the loading. If a more appropriate data format was used, there would be no need to be clever.

This is not the most helpful comment of mine, because I presume folks here want this bug fixed in less time than it would take to rearchitect the Tensile on-disk data format. The separate-architectures and lazy-loading features just frustrate me. I was there when those features were designed and implemented (by a very close friend of mine who is no longer at AMD), and I told the author this back then too.

Redesign the on-disk data format and you will:

Reduce the number of bugs in Tensile
Improve Tensile initialization performance
Reduce the size and complexity of the Tensile codebase

Feb 13 '24 05:02 cgmb

to make work my rx5700 with llamacpp? want to ask frist before do the test and break my ubuntu lol

Use librocblas-dev and libhipblas-dev from Ubuntu 23.10 or later. Here's an example of how to build and run llama-cpp for any discrete Vega, RDNA 1, RDNA 2, CDNA 1 or CDNA 2 GPU in a docker container: https://gist.github.com/cgmb/be113c04cd740425f637aa33c3e4ea33

It might also work on Polaris, but it might not (since the software for that architecture has a lot of bugs).

Feb 13 '24 05:02 cgmb

@smirgol one of the contributors of rocblas says that we can compile llamacpp with hipblas and mix old and new gpus

https://github.com/ROCm/rocBLAS/pull/1251#issuecomment-1936685074

Feb 16 '24 20:02 userbox020

lol i just notice its cgmb, sup bro i just dm you in the other repo chat lol

Feb 16 '24 20:02 userbox020

This change triggered a fail in rocblas test. We cannot add this change to our release until we solve the issue.

Fix for gemm_ex_get_solutions issue has been merged into Tensile and rocBLAS develop branch. We might be able to try the previous fix for gfx1010 on top of the latest Tensile develop. I do not have gfx1010 environment to confirm the fix.

Mar 04 '24 21:03 nakajee

@nakajee I think at the current stage we don't have to test on gfx1010 yet. The first step is to confirm that when compiling any already supported arch with gfx1010 (such as AMDGPU_TARGETS="gfx1010;gfx1030"), all tests pass, as per the directions specified in #1897. Currently I cannot build rocBLAS at head (5937a87d) with ROCm 6.0 because I get the following error message:

# Tensile Create Library
Tensile::WARNING: Did not detect SupportedISA: [(8, 0, 3), (9, 0, 0), (9, 0, 6), (9, 0, 8), (9, 0, 10), (9, 4, 0), (9, 4, 1), (9, 4, 2), (10, 1, 0), (10, 1, 1), (10, 1, 2), (10, 3, 0), (10, 3, 1), (11, 0, 0), (11, 0, 1), (11, 0, 2)]; cannot benchmark assembly kernels.
# Found  hipcc version 6.0.0-0
Tensile::FATAL: Cached asm caps differ from derived asm caps for (9, 0, 10)
CMake Error at build/virtualenv/cmake/TensileConfig.cmake:277 (message):
  Error creating Tensile library: 255
Call Stack (most recent call first):
  library/src/CMakeLists.txt:74 (TensileCreateLibraryFiles)

Mar 05 '24 01:03 GZGavinZhao

If you can help test on your environment that'd be great.

Mar 05 '24 01:03 GZGavinZhao

Tensile Tensile copied to clipboard

Tensile won't produce backend libraries for archs without optimized logic files when using --separate-architectures

Issue

Description

Expected

Workaround

Patch

Tensile
Tensile copied to clipboard