Broken tests on RX 6950 XT
Hi,
I have been having some issues with some downstream packages using AMDGPU.jl, so I was trying to backtrack. I am running on Manjaro with an RX 6950 XT and the ROCm version coming from Arch repositories, version 6.1. I am on Julia 1.10.4 from juliaup and when I `Pkg.test("AMDGPU"), I get the following output:
Test Summary: | Pass Fail Error Broken Total Time
AMDGPU | 13173 2 2 151 13328 10m11.5s
test | 13173 2 2 151 13328
test/core_tests.jl | 615 1 616
core | 615 1 616 1m14.8s
Functional | 2 2 0.1s
HIPDevice | 8 8 0.0s
ISA parsing | 10 10 0.0s
Exception holder | None 2.1s
Comparison | 3 3 0.0s
Synchronization | 1 1 5.3s
Trapping | 2 2 0.0s
Base | 557 1 558 55.8s
Specifying buffer type | 4 4 0.0s
ones/zeros | 2 2 1.2s
view | 10 10 1.7s
resize! | 3 3 0.3s
unsafe_wrap | 17 17 3.9s
unsafe_free | None 0.0s
accumulate | 25 25 6.4s
Atomics | 1 1 0.3s
Sorting | 384 384 31.2s
Reverse kernel | 88 88 2.9s
Selection | 3 3 1.6s
Multi-GPU | 20 1 21 3.2s
Device switching | 7 7 0.2s
Arrays | 5 5 1.0s
Copying | 1 1 0.8s
Kernel | 1 1 2 1.0s
Correctly switching HIP context | 6 6 0.3s
broadcast | 18 18 6.4s
Ref Broadcast | 1 1 0.5s
Broadcast Fix | 2 2 0.7s
Broadcast Ref{<:Type} | 1 1 0.3s
Device | 3 3 0.0s
Stream | 7 7 0.3s
test/device_tests.jl | 473 9 482
test/external_tests.jl | 18 18
test/gpuarrays_tests.jl | 7213 7213
test/hip_core_tests.jl | 4 1 5
hip - core | 4 1 5 2.3s
AMDGPU.@elapsed | 4 4 0.6s
HIP Peer Access | 1 1 0.4s
test/hip_miopen_tests.jl | 1 1
hip - MIOpen | 1 1 0.0s
test/hip_rocblas_tests.jl | 672 1 673
hip - rocBLAS | 672 1 673 1m08.2s
BLAS | 672 1 673 1m05.4s
Build Information | 1 1 0.2s
Highlevel | 2 2 3.8s
Level 1 | 51 1 52 10.6s
T = Float32 | 13 13 1.0s
T = Float64 | 13 13 0.7s
T = ComplexF32 | 12 1 13 7.5s
T = ComplexF64 | 13 13 1.4s
Level 2 | 172 172 12.7s
Level 3 | 446 446 38.1s
test/hip_rocfft_tests.jl | 199 199
test/hip_rocrand_tests.jl | 141 141
test/hip_rocsolver_tests.jl | 538 538
test/hip_rocsparse_tests.jl | 1099 136 1235
test/ka_tests.jl | 2201 6 2207
ERROR: LoadError: Some tests did not pass: 13173 passed, 2 failed, 2 errored, 151 broken.
in expression starting at /home/fra/.julia/packages/AMDGPU/a1v0k/test/runtests.jl:107
ERROR: Package AMDGPU errored during testing
Is this expected behaviour? I do have an integrated APU (which I don't use at the moment), so it might be why some of the MultiGPU tests are failing.
ROCm does not support integrated APU I think, but since it is visible it tries to run multi-gpu tests.
If you hide it with HIP_VISIBLE_DEVICES and some tests still fail, you can share error messages for those
Sorry for the long delay! So excluding the multi-GPU tests I get one failed, one with error and 153 broken. The one with error is with MIopen:
MIOpen Error: /usr/src/debug/miopen-hip/MIOpen-rocm-6.0.2/src/ocl/convolutionocl.cpp:129: Invalid filter channel number
MIOpen(HIP): Warning [ValidateGroupCount] NCHWw {10, 4, 10, 10}, x {4, 2, 2, 2}, groups = 2
MIOpen Error: /usr/src/debug/miopen-hip/MIOpen-rocm-6.0.2/src/ocl/convolutionocl.cpp:129: Invalid filter channel number
MIOpen Error: /usr/src/debug/miopen-hip/MIOpen-rocm-6.0.2/src/convolution.cpp:271: Channels do not match for the filter
/usr/lib64/gcc/x86_64-pc-linux-gnu/13.2.1/../../../../include/c++/13.2.1/bits/stl_vector.h:1144: const_reference std::vector<unsigned long>::operator[](size_type) const [_Tp = unsigned long, _Alloc = std::allocator<unsigned long>]: Assertion '__n < this->size()' failed.
Whereas the failed one is in rocBLAS, with T=ComplexF32.
These MIOpen errors are most likely during algorithm search and are not fatal. It mostly means there's no suitable algorithm for the current backend (OpenCL) so it moves to other backends (ASM, HIP).
For other errors, if you can post full test summary (including stacktraces) that is printed at the end that would be helpful as almost all rocBLAS functions are tested with ComplexF32 type.
MIOpen Error: /usr/src/debug/miopen-hip/MIOpen-rocm-6.0.2/src/ocl/convolutionocl.cpp:129: Invalid filter channel number
Actually this was a bug in fwd conv workspace calculation. Fixed by #678