AMDGPU.jl icon indicating copy to clipboard operation
AMDGPU.jl copied to clipboard

Trying to import AMDGPU fails with (an lvm?) error

Open pulkin opened this issue 1 year ago • 11 comments

: CommandLine Error: Option 'disassemble' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options

Not sure where from to approach. Platform is Fedora 39

> dnf list installed "rocm-*"
Installed Packages
rocm-clinfo.x86_64                                                                         5.7.1-1.fc39                                                                   @updates
rocm-comgr.x86_64                                                                          17.0-3.fc39                                                                    @updates
rocm-comgr-devel.x86_64                                                                    17.0-3.fc39                                                                    @updates
rocm-device-libs.x86_64                                                                    17.1-1.fc39                                                                    @updates
rocm-hip.x86_64                                                                            5.7.1-1.fc39                                                                   @updates
rocm-hip-devel.x86_64                                                                      5.7.1-1.fc39                                                                   @updates
rocm-opencl.x86_64                                                                         5.7.1-1.fc39                                                                   @updates
rocm-runtime.x86_64                                                                        5.7.1-1.fc39                                                                   @updates
rocm-runtime-devel.x86_64                                                                  5.7.1-1.fc39                                                                   @updates

Traced it down to this

julia> import Libdl

julia> Libdl.dlpath("libamdhip64")
"/usr/bin/../lib64/julia/../libamdhip64.so"

julia> Libdl.dlpath("libamdhip64")
: CommandLine Error: Option 'disassemble' registered more than once!

pulkin avatar Jan 06 '24 14:01 pulkin

My guess is that HIP is built with statically linked LLVM in Fedora, when it probably should link dynamically.

pxl-th avatar Jan 07 '24 11:01 pxl-th

You can check if this is true by dev'ing AMDGPU package with ]dev AMDGPU. Then in ~/.julia/dev/AMDGPU.jl directory create a LocalPreferences.toml file with the following content:

[AMDGPU]
use_artifacts = true

Then try importing it again. Artifacts don't have all the libraries and are of older ROCm version, but at least you'll be able to confirm that dynamically linked LLVM is what you need.

pxl-th avatar Jan 07 '24 11:01 pxl-th

Thanks. Tried creating ~/.julia/dev/AMDGPU/LocalPreferences.toml with the content but the error is the same. The stack trace points to /.julia/dev/AMDGPU so it seems like the right place to do it but there is no effect. If I corrupt LocalPreferences.toml layout it will also ignore that.

pulkin avatar Jan 07 '24 16:01 pulkin

Managed to force artifacts through JULIA_AMDGPU_DISABLE_ARTIFACTS=false julia. It imports now: дякую.

pulkin avatar Jan 07 '24 16:01 pulkin

Thanks. Tried creating ~/.julia/dev/AMDGPU/LocalPreferences.toml with the content but the error is the same. The stack trace points to /.julia/dev/AMDGPU so it seems like the right place to do it but there is no effect. If I corrupt LocalPreferences.toml layout it will also ignore that.

Forgot to mention that you then need to launch julia with project set to the AMDGPU.jl folder:

julia --project=~/.julia/dev/AMDGPU.jl

Otherwise, you should put that file where your current project is (and modify project path accordingly). But for global projects it is better to use env variable, yes.

The downside of artifacts is that you can use only Julia kernels, so things like matmul (rocBLAS) are not available (and other stuff).

Будь ласка :)

pxl-th avatar Jan 07 '24 17:01 pxl-th

Most tests pass, 263 errored, 19 broken. Back to the original issue, my system library has libLLVM-17.so in its dependencies though

> ldd /usr/bin/../lib64/julia/../libamdhip64.so
	linux-vdso.so.1 (0x00007ffd6f2f4000)
	libamd_comgr.so.2 => /lib64/libamd_comgr.so.2 (0x00007f7b42800000)
	libhsa-runtime64.so.1 => /lib64/libhsa-runtime64.so.1 (0x00007f7b42400000)
	libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f7b448f8000)
	libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f7b42000000)
	libm.so.6 => /lib64/libm.so.6 (0x00007f7b4271f000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f7b448d4000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f7b41e1e000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f7b44924000)
	liblldELF.so.17 => /lib64/liblldELF.so.17 (0x00007f7b41a00000)
	liblldCommon.so.17 => /lib64/liblldCommon.so.17 (0x00007f7b431d4000)
	libclang-cpp.so.17 => /lib64/libclang-cpp.so.17 (0x00007f7b3d800000)
	libLLVM-17.so => /lib64/libLLVM-17.so (0x00007f7b36200000)
	libhsakmt.so.1 => /lib64/libhsakmt.so.1 (0x00007f7b431a6000)
	libelf.so.1 => /lib64/libelf.so.1 (0x00007f7b43189000)
	libdrm.so.2 => /lib64/libdrm.so.2 (0x00007f7b43172000)
	libz.so.1 => /lib64/libz.so.1 (0x00007f7b43158000)
	libffi.so.8 => /lib64/libffi.so.8 (0x00007f7b43148000)
	libedit.so.0 => /lib64/libedit.so.0 (0x00007f7b426e2000)
	libtinfo.so.6 => /lib64/libtinfo.so.6 (0x00007f7b426ad000)
	libdrm_amdgpu.so.1 => /lib64/libdrm_amdgpu.so.1 (0x00007f7b4313b000)
	libzstd.so.1 => /lib64/libzstd.so.1 (0x00007f7b42344000)

To provide you the context, I am looking into porting this small CUDA PoC

https://github.com/jinwen-yang/cuPDLP.jl/tree/master

to run on my 6600. It does not look like there is a lot to port (uses zeros, norm, dot and sparse arrays). But maybe sparse arrays or something else are out of the question until I resolve the original q?

pulkin avatar Jan 07 '24 20:01 pulkin

Mine looks like this:

$ ldd /opt/rocm/lib/libamdhip64.so
	linux-vdso.so.1 (0x00007ffec6387000)
	libamd_comgr.so.2 => /opt/rocm/lib/libamd_comgr.so.2 (0x00007f6a7d600000)
	libhsa-runtime64.so.1 => /opt/rocm/lib/libhsa-runtime64.so.1 (0x00007f6a7d200000)
	libnuma.so.1 => /lib/x86_64-linux-gnu/libnuma.so.1 (0x00007f6a87beb000)
	libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f6a7ce00000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f6a87b04000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f6a87ae2000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6a7ca00000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f6a87c0f000)
	libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f6a87ac6000)
	libtinfo.so.6 => /lib/x86_64-linux-gnu/libtinfo.so.6 (0x00007f6a85fce000)
	libelf.so.1 => /lib/x86_64-linux-gnu/libelf.so.1 (0x00007f6a85fb0000)
	libdrm.so.2 => /opt/amdgpu/lib/x86_64-linux-gnu/libdrm.so.2 (0x00007f6a85f96000)
	libdrm_amdgpu.so.1 => /opt/amdgpu/lib/x86_64-linux-gnu/libdrm_amdgpu.so.1 (0x00007f6a87ab6000)

Not really sure what to suggest besides recompiling HIP without linking against LLVM, but then you'd need to change this line to point to your .so. But then I'm not sure if they link other things, MIOpen, for example uses LLVM to JIT compile some of the kernels at runtime.

As an alternative approach, I prefer to get ROCm from the official install script which links dynamically, but it does not have Fedora support.

pxl-th avatar Jan 08 '24 08:01 pxl-th