Tensile
Tensile copied to clipboard
Tensile won't produce backend libraries for archs without optimized logic files when using --separate-architectures
Issue
Tensile won't produce backend libraries for archs without optimized logic files when using --separate-architectures.
Description
According with https://github.com/ROCmSoftwarePlatform/Tensile/issues/1165#issuecomment-1094556880 "gfx1010 has been enabled by default in rocBLAS builds since ROCm 4.3.0." however since rocBLAS does not have optimized logic files for navi10 no library is produced for gfx1010.
$ drun --rm rocm/dev-ubuntu-22.04:5.6-complete
root@ftl:/# ls -1 /opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx*
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1100.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1101.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1102.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx803.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx900.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx906.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx908.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx90a.dat
Expected
Tensile should produce libraries for all requested architectures, using the fallback logic files for archs missing optimized logic files.
Workaround
Building rocBLAS with --merge-architectures --no-lazy-library-loading
seems to avoid the issue.
Patch
https://github.com/ulyssesrr/docker-rocm-xtra/blob/3be41a9d79ff4f4324f3f34383b2282529c0c4b7/rocm-xtra-builder-rocblas/patches/Tensile-fix-fallback-arch-build.patch
Although that's probably not the right place, I really needed to say thank you! I've struggled with that basically since my card has been released and finally I was able to fix it because of you.
Doing compute stuff is just a nightmare with AMD, really.
#1862 has an updated version of this patch for ROCm >=5.5.
This change triggered a fail in rocblas test. We cannot add this change to our release until we solve the issue.
I actually have a fix for a test failure that I just found out today, but may I get a failure log to ensure that it's the same failure I'm getting?
command line: ./rocblas-test --gtest_output=xml --gtest_color=yes --gtest_filter=*quick*:*pre_checkin*-*known_bug*
Error:
[----------] 32 tests from _/gemm_ex_get_solutions
/var/jenkins_home/workspace/eckin_rocBLAS-internal_develop_2/z344iq86D/rocblas/clients/gtest/../include/blas_ex/testing_gemm_ex_get_solutions.hpp:152: Failure
Value of: status_match(rocblas_status_success, status_)
Actual: false (got rocblas_status_invalid_value instead of rocblas_status_success)
Expected: true
[ FAILED ] _/gemm_ex_get_solutions.blas3_tensile/pre_checkin_gemm_ex_get_solutions_f16_rf16_rf16_rf16_rf32_r_CN_250_250_250_1_250_250_1_250_250, where GetParam() = { function: "gemm_ex_get_solutions", name: "gemm_ex_get_solutions", category: "pre_checkin", known_bug_platforms: "", beta: 1.0, stride_a: 62500, stride_b: 62500, stride_c: 62500, stride_d: 62500, M: 250, N: 250, K: 250, lda: 250, ldb: 250, ldc: 250, ldd: 250, a_type: f16_r, b_type: f16_r, c_type: f16_r, d_type: f16_r, composite_compute_type: invalid, initialization: rand_int, gpu_arch: "", flush_batch_count: 1, transA: 'C', transB: 'N' }
(6 ms)
We are investigating the issue now, but have not found the cause yet.
As long as I tried, this fail does not happen if I revert the fallback change.
I'm building rocBLAS locally to test. I've been working on ISA compatibility improvements in rocBLAS so my local copy has some modifications. With my current modifications my gfx1032
GPU is passing the test you mentioned, so I'm stashing my changes and building the develop
branch right now to check if I can reproduce this failure.
While it's building, could you change the if statement at https://github.com/ROCm/rocBLAS/blob/5211f0dca313c56c2163b8602581242c8cb608f1/library/src/tensile_host.cpp#L991C1-L992C43 from
if(library)
*library = host.get_library();
to
if(adapter)
*library = host.get_library();
and see if you get a segfault (sigsev)?
Actually. I could not reproduce the fail with my local rocblas build, but it fails on our CI environment. I do not quite understand your change. Could we ignore library==NULL case?
What GPU arch does the CI environment has?
Could we ignore library==NULL case?
No, sometimes library
is NULL when execution reaches this if statement. I don't know why library
would become NULL, but this has caused several segfaults on me so in my local changes I had to check if library
is NULL and assign host.get_library()
to library
if it is.
The fail above is gfx1101. I do not have gfx1032 environment.
To clarify, above that line there's a comment:
// If an adapter is found, it is assumed that the library is initialized
If the "library" refers to the library
variable, then this doesn't always holds. I have logged the values of adapter
and library
and sometimes adapter
is non-NULL while library
is NULL, which will then cause segfaults in runContractionProblem
.
Also does it only fail on Level-3? Or are there also failures with Level-2 and Level-1 operations as well?
I am not familiar with rocblas side, but if get_library_and_adapter() is called from rocblas_initialize(), library seems to be NULL since library is not specified here.
The fail above is gfx1101. I do not have gfx1032 environment.
I'm confused as to why my change would affect gfx1101
, as gfx1101
should have optimized logic files so fallback libraries shouldn't even be compiled? Does the log contain lines like Using fallback for arch: <arch>
?
If we do not specify -a option when building rocblas, rocblas picks Tensile library for all architectures including gfx1010 and 1012 added by the fallback change. For some reason, it affects solution selection (which should not happen).
That is my guess. We still do not understand why it fails.
rocBLAS still compiling. Will report back when I get to run the tests and reproduce the failure.
A SIGSEV was triggered, let me debug what went wrong.
Edit: the exact failure also reproduced.
rocblas_gemm_batched_ex_get_solutions
is doing something weird. For a Contraction_l_Ailk_Bjlk_Cijk_Dijk
problem, it returned a solution index corresponding to Cijk_Alik_Bljk_HB_MT128x64x16_SN_AMAS3_BL1_BS1_EPS0_GLVWA8_GLVWB8_GRVW8_GSU1_GSUASB_ISA1030_IU1_K1_KLA_LDL1_LRVW8_MMFGLC_NLCA1_NLCB1_PGR1_PLR1_SIA1_SU0_SUM0_SUS0_SVW4_TT8_8_USFGROn1_VAW2_VSn1_VW8_VWB8_WS32_WG16_8_1_WGM1
. Other similar problems returned indices in the range of a few 3000~5000, while for this particular problem solution with indices 1 and 2 are returned. Will do more investigation during the weekend.
Putting some investigation notes here. I will spend more time to dig through this later in this week, but if anyone wants to investigate feel free to build on top of here.
Through my tracing I found that solution selection doesn't seem to be affected. If you print out every single solution found in getAllSolutions
, you will see that they all correspond to the correct solution. The problem is that despite the library
object being the same (verified by printing the address of library
), with the same index library->getSolutionByIndex(index)
in runContractionProblem
and getAllSolutions
return different solutions. This is what baffled me. Basically if you run library->getSolutionByIndex(2)
in runContractionProblem
and getAllSolutions
, they will return different solutions despite calling on the same library object. Will investigate further later in this week to see why this happens.
hello guys, fist of all thanks for all the hard work you doing to make rx5700 work, im just a hobbist and not even close to be near your league of expertice. However i would like to ask if meanwhile the problem is solve, can i do something similar to
export ROCM_PATH=/opt/rocm
export HCC_AMDGPU_TARGET=gfx803
export HSA_OVERRIDE_GFX_VERSION=8.3.0
to make work my rx5700 with llamacpp? want to ask frist before do the test and break my ubuntu lol
hello guys, fist of all thanks for all the hard work you doing to make rx5700 work, im just a hobbist and not even close to be near your league of expertice. However i would like to ask if meanwhile the problem is solve, can i do something similar to
export ROCM_PATH=/opt/rocm export HCC_AMDGPU_TARGET=gfx803 export HSA_OVERRIDE_GFX_VERSION=8.3.0
to make work my rx5700 with llamacpp? want to ask frist before do the test and break my ubuntu lol
I'm not sure any more if that is all that it takes, because I fiddled a LOT to make my now replaced RX 5700 XT work, but I've used these settings:
export PATH="/opt/rocm/bin:$PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/opt/rocm/lib/"
export HSA_OVERRIDE_GFX_VERSION=10.1.0
export HCC_AMDGPU_TARGET=gfx1010
In some places I've also used
export PATH="/opt/rocm/bin:$PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/opt/rocm/lib/"
export HSA_OVERRIDE_GFX_VERSION=10.3.0
export HCC_AMDGPU_TARGET=gfx1030
ROCM_VERSION=5.6
but in any case I did not use the overrides that you have used, with these low versions / numbers.
That shouldn't break anything, as it is solely related to things that make use of ROCm. Worst case it won't work. Sorry that I can't give you more hints, I have forgotten most of the things I've tried to make it work and a lot of it was blindly poking at things anyway. But these tensile libraries this issue is about definitely did help at some point.
Edit: For LLMs I ended up using mainly https://github.com/YellowRoseCx/koboldcpp-rocm for loading the LLMs and then e.g. SillyTavern for the frontend. I had quite some issues with oobabooga back then, but these might have been resolved meanwhile.
The way that separate architectures and lazy loading were implemented was really not ideal. The complexity of building all the necessary data structures during initialization should really be pushed to build time, and there should be no meaningful logic executing during initialization. The initialization could be nothing more than read
or mmap
and then there would be no need to split the dat files at all, because it would be so fast to load that you could read everything in a fraction of the time the current implementation takes. Or, you could read the parts you needed on-demand. There has been so much complexity introduced into Tensile just to avoid fixing the underlying data representation on disk.
The use of an unindexed key-value pair format like msgpack is the underlying cause of these bugs, because the slow conversion of that data into the Tensile in-memory format drives the introduction of complicated logic to try to be clever about the loading. If a more appropriate data format was used, there would be no need to be clever.
This is not the most helpful comment of mine, because I presume folks here want this bug fixed in less time than it would take to rearchitect the Tensile on-disk data format. The separate-architectures and lazy-loading features just frustrate me. I was there when those features were designed and implemented (by a very close friend of mine who is no longer at AMD), and I told the author this back then too.
Redesign the on-disk data format and you will:
- Reduce the number of bugs in Tensile
- Improve Tensile initialization performance
- Reduce the size and complexity of the Tensile codebase
to make work my rx5700 with llamacpp? want to ask frist before do the test and break my ubuntu lol
Use librocblas-dev and libhipblas-dev from Ubuntu 23.10 or later. Here's an example of how to build and run llama-cpp for any discrete Vega, RDNA 1, RDNA 2, CDNA 1 or CDNA 2 GPU in a docker container: https://gist.github.com/cgmb/be113c04cd740425f637aa33c3e4ea33
It might also work on Polaris, but it might not (since the software for that architecture has a lot of bugs).
@smirgol one of the contributors of rocblas says that we can compile llamacpp with hipblas and mix old and new gpus
https://github.com/ROCm/rocBLAS/pull/1251#issuecomment-1936685074
lol i just notice its cgmb, sup bro i just dm you in the other repo chat lol
This change triggered a fail in rocblas test. We cannot add this change to our release until we solve the issue.
Fix for gemm_ex_get_solutions issue has been merged into Tensile and rocBLAS develop branch. We might be able to try the previous fix for gfx1010 on top of the latest Tensile develop. I do not have gfx1010 environment to confirm the fix.
@nakajee I think at the current stage we don't have to test on gfx1010
yet. The first step is to confirm that when compiling any already supported arch with gfx1010
(such as AMDGPU_TARGETS="gfx1010;gfx1030"
), all tests pass, as per the directions specified in #1897. Currently I cannot build rocBLAS at head (5937a87d
) with ROCm 6.0 because I get the following error message:
# Tensile Create Library
Tensile::WARNING: Did not detect SupportedISA: [(8, 0, 3), (9, 0, 0), (9, 0, 6), (9, 0, 8), (9, 0, 10), (9, 4, 0), (9, 4, 1), (9, 4, 2), (10, 1, 0), (10, 1, 1), (10, 1, 2), (10, 3, 0), (10, 3, 1), (11, 0, 0), (11, 0, 1), (11, 0, 2)]; cannot benchmark assembly kernels.
# Found hipcc version 6.0.0-0
Tensile::FATAL: Cached asm caps differ from derived asm caps for (9, 0, 10)
CMake Error at build/virtualenv/cmake/TensileConfig.cmake:277 (message):
Error creating Tensile library: 255
Call Stack (most recent call first):
library/src/CMakeLists.txt:74 (TensileCreateLibraryFiles)
If you can help test on your environment that'd be great.