AMDGPU.jl icon indicating copy to clipboard operation
AMDGPU.jl copied to clipboard

Update ROCSparse for Julia v1.10

Open amontoison opened this issue 1 year ago • 13 comments

@pxl-th @dkarrasch

amontoison avatar Apr 02 '24 04:04 amontoison

FYI, I have disabled tests for rocSPARSE temporarily since they were crashing my Navi 3 in CI and I didn't have the time to investigate the final cause. Also for some reason rocBLAS tests segfault on ROCm 5.6 (@luraess).

@amontoison have you run the tests locally? We can of course re-enable rocSPARSE tests, but I'm not sure they will run successfully

pxl-th avatar Apr 02 '24 16:04 pxl-th

@pxl-th The tests passed on our cluster.

amontoison avatar Apr 06 '24 23:04 amontoison

@amontoison, I've sent you an invite to be able to merge PRs. I currently don't have access to AMD GPUs and therefore not working on AMDGPU.jl. So feel free to merge PRs once they are in a good state (although I'd recommend to merge them if CI is green).

pxl-th avatar Apr 07 '24 18:04 pxl-th

I can try running the tests on my system @pxl-th (now with ROCm 6.0.2 on Navi 3). On which system did they pass @amontoison ?

luraess avatar Apr 07 '24 18:04 luraess

I can try running the tests on my system @pxl-th (now with ROCm 6.0.2 on Navi 3). On which system did they pass @amontoison ?

It was on Frontier. I need to check with @michel2323 the version of ROCm.

amontoison avatar Apr 07 '24 20:04 amontoison

ROCm 6.0 it was on an MI250.

michel2323 avatar Apr 08 '24 03:04 michel2323

Running the ROCSparse tests on Navi 3 (gfx1101 - Radeon RX 7800 XT) and ROCm 6.0.2 I am getting the following test that error (alongside with an error in ROCBlas) test_log_out.txt.

luraess avatar Apr 08 '24 07:04 luraess

@luraess Can you check if the tests for rocSPARSE are failing or not.on the branch master?

Can you also give more details about the errors. I suspect that something is not correctly dispatched because all the units tests for mv! and mm! passed.

amontoison avatar Apr 09 '24 07:04 amontoison

Running only the rocSparse tests on master I am getting some warnings but no errors. There is still the failing BLAS test. rocSaprse_out.txt

luraess avatar Apr 09 '24 10:04 luraess

@luraess Can you just run include("test/rocarray/blas.jl")?

amontoison avatar Apr 14 '24 01:04 amontoison

@luraess Can you just run include("test/rocarray/blas.jl")?

Yes, here is the output of running the test test_out.txt

luraess avatar Apr 14 '24 12:04 luraess

Thanks @luraess! But you need to import additional packages to isolate the issue:

using AMDGPU
using LinearAlgebra

import GPUArrays
include(joinpath(pkgdir(GPUArrays), "test", "testsuite.jl"))

testf(f, xs...; kwargs...) =
    TestSuite.compare(f, AMDGPU.ROCArray, xs...; kwargs...)

include("test/rocarray/blas.jl")

amontoison avatar Apr 14 '24 16:04 amontoison

Thanks for the hints. Following those I am getting a segfault on Navi3 - ROCm 6.0.2 (blas_navi3.txt) and a bunch of errors on MI250x - ROCm 5.3.3 on LUMI (blas_lumi.txt).

luraess avatar Apr 14 '24 22:04 luraess