OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

Test kernel_regress:skx_avx fails on RISC-V platform

Open leavelet opened this issue 1 year ago • 7 comments

Environment:

OpenBLAS version: release 0.3.26 OS: revyos CPU: Sophgo sg2042, RISC-V rv64imafdc with rvv 0.71 Compiler: g++ 10.4, THead version. https://github.com/revyos/gcc/tree/revyos-gcc10.4-thead-dev Compile command: make HOSTCC=gcc-10 TARGET=C910V CC=riscv64-linux-gnu-gcc-10 FC=riscv64-linux-gnu-gfortran-10 -j 64

Error log:

TEST 38/40 kernel_regress:skx_avx [FAIL]
  ERR: test_kernel_regress.c:50  expected 0.000e+00, got 2.719e+04 (diff -2.719e+04, tol 1.000e-10)

By the way, the risc-v branch stuck on the line below

OPENBLAS_NUM_THREADS=2 ./cblat3 < ./cblat3.dat

leavelet avatar Jan 20 '24 16:01 leavelet

@RevySR

leavelet avatar Jan 20 '24 16:01 leavelet

kernel_regress:skx_avx is DGEMM, maybe we should rename it as its history as an AVX512 bug in the SkylakeX kernel is irrelevant today...

in CI it works with a different vendor toolchain based on GCC 10.2 (see .github/workflow/c910v.yml for the URL), but of course the tests there use only qemu instead of the actual hardware

martin-frbg avatar Jan 20 '24 18:01 martin-frbg

Since the CI with GCC 10.2 works fine, maybe it is a vendor problem. I shall work with Revy to resolve it.

leavelet avatar Jan 20 '24 18:01 leavelet

Any updates on this ? I've since merged the risc-v branch as I could not reproduce the problems in CI or local qemu, but I lack real C910V hardware at the moment.

martin-frbg avatar Feb 07 '24 17:02 martin-frbg

The GEMM issue is fixed in #4454. We have found another issue in kernel/riscv64/nrm2_vector.c, which hasn't been fixed yet. Keeping this issue open until we fix the nrm2 issue or closing it to open a new one both work fine; I'm not sure which one is better.

leavelet avatar Feb 07 '24 18:02 leavelet

thanks. can keep this one open for simplicity (unless you expect this to take long, in which case opening a new issue with appropriate title might help others find it faster). annoying that it seems to depend so much on compiler version, or qemu vs actual hardware

martin-frbg avatar Feb 07 '24 19:02 martin-frbg

I cannot reproduce either of these issues on MilkV Pioneer with current develop and a thead gcc built from the current state of their source repository.

martin-frbg avatar Jun 18 '24 14:06 martin-frbg