OpenBLAS Possible data races detected with ThreadSanitizer in ger

I built TSAN instrumented OpenBLAS with clang-18, here is the build command:

cd /project/OpenBLAS

export CC=clang-18
export CXX=clang++-18
export CFLAGS="-fsanitize=thread -g"
export CXXFLAGS="-fsanitize=thread -g"
export LDFLAGS=-fsanitize=thread

apt-get update && apt-get install -y perl --no-install-recommends
make -j 32 shared
make -j 32 install

and then I run utests with few threads:

OMP_NUM_THREADS=2 make -C utest &> make_utest.nt-2.log

ThreadSanitizer reports various data races: https://gist.github.com/vfdev-5/7e57d305f100003aa4b9ada11efbcf0e#file-make_utest-nt-2-log

Environment info:

linux, Ubuntu 24.04.1 LTS

# Makefile.conf
OSNAME=Linux
ARCH=x86_64
C_COMPILER=CLANG
BINARY32=
BINARY64=1
CEXTRALIB= -L/usr/bin/../lib/gcc/x86_64-linux-gnu/14 -L/usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../lib64 -L/lib/x86_64-linux-gnu -L/lib/../lib64 -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib64 -L/lib -L/usr/lib  -lpthread -lrt -lm -ldl -lresolv -lc  /usr/lib/llvm-18/lib/clang/18/lib/linux/libclang_rt.tsan-x86_64.a
F_COMPILER=GFORTRAN
FC=gfortran
BU=_
NOFORTRAN=1
CORE=ZEN
LIBCORE=zen
NUM_CORES=128
HAVE_MMX=1
HAVE_SSE=1
HAVE_SSE2=1
HAVE_SSE3=1
HAVE_SSSE3=1
HAVE_SSE4_1=1
HAVE_SSE4_2=1
HAVE_SSE4A=1
HAVE_AVX=1
HAVE_AVX2=1
HAVE_FMA3=1
MAKEFLAGS += -j 128
SBGEMM_UNROLL_M=8
SBGEMM_UNROLL_N=4
SGEMM_UNROLL_M=8
SGEMM_UNROLL_N=4
DGEMM_UNROLL_M=4
DGEMM_UNROLL_N=8
QGEMM_UNROLL_M=2
QGEMM_UNROLL_N=2
CGEMM_UNROLL_M=8
CGEMM_UNROLL_N=2
ZGEMM_UNROLL_M=4
ZGEMM_UNROLL_N=2
XGEMM_UNROLL_M=1
XGEMM_UNROLL_N=1
CGEMM3M_UNROLL_M=8
CGEMM3M_UNROLL_N=4
ZGEMM3M_UNROLL_M=4
ZGEMM3M_UNROLL_N=4
XGEMM3M_UNROLL_M=2
XGEMM3M_UNROLL_N=2

Feb 26 '25 11:02 vfdev-5

Thanks @vfdev-5, this is useful. I noticed that a number of the issues are present in ger_thread, and ger* is the only routine where I found issues when doing parallel testing of scipy.linalg: https://github.com/scipy/scipy/issues/21936. It's possible that that's a coincidence, but I'll try to dig into that a bit this week.

Feb 26 '25 12:02 rgommers

@martin-frbg for context: in Python there's a significant effort happening to remove the GIL (Global Interpreter Lock), so a free-threading CPython interpreter now allows calling Python functions in parallel threads at the same time. Meaning numpy.linalg and scipy.linalg (and PyTorch, and other packages with direct BLAS/LAPACK bindings) are going to be calling OpenBLAS from parallel threads in the same process. TSAN is used to detect potential issues with that.

Feb 26 '25 12:02 rgommers

Thanks - I'm looking into level 2 BLAS thread safety in conjunction with #5104 - most attention had been on level 3 and the thread server code in the past

Feb 26 '25 12:02 martin-frbg

The tsan log does not make much sense to me. Can't say I'm much happier with helgrind though, as it flags a couple of atomic operations in blas_server.c and "possible races" between Fortran and C accessing the work array

Feb 26 '25 18:02 martin-frbg

The tsan log does not make much sense to me. Can't say I'm much happier with helgrind though, as it flags a couple of atomic operations in blas_server.c and "possible races" between Fortran and C accessing the work array

I'm not familiar with your code so my diagnosis may be wrong, but the tsan race looks right to me.

The read here https://github.com/OpenMathLib/OpenBLAS/blob/ef9e3f71595edfc69aadeb34d99a96d5f72a29a2/driver/others/blas_server.c#L1056

and the write here: https://github.com/OpenMathLib/OpenBLAS/blob/ef9e3f71595edfc69aadeb34d99a96d5f72a29a2/driver/level2/ger_thread.c#L180

are racy because I don't think we've done anything to order the reader and the writer.

The only synchronization is via the queue pointer, I think, but you use relaxed reads and writes which, critically, do not imply ordering of any other reads or writes, even from the same thread, with respect to the atomic read/write. I think you need acquire/release ordering ?

Feb 26 '25 18:02 hawkinsp

The tsan log does not make much sense to me. Can't say I'm much happier with helgrind though, as it flags a couple of atomic operations in blas_server.c and "possible races" between Fortran and C accessing the work array

I have an impression that the race reported as

WARNING: ThreadSanitizer: data race (pid=1458024)
  Read of size 8 at 0x7fffffff7408 by thread T1:
    #0 ger_kernel /project/OpenBLAS/driver/level2/ger_thread.c:61:24 (openblas_utest+0x26e79d) (BuildId: c8da20250a572463b1233c83a38721c41a6dc52f)
    #1 exec_threads /project/OpenBLAS/driver/others/blas_server.c:1148:3 (openblas_utest+0x14a05a) (BuildId: c8da20250a572463b1233c83a38721c41a6dc52f)
    #2 blas_thread_server /project/OpenBLAS/driver/others/blas_server.c:463:5 (openblas_utest+0x1498f7) (BuildId: c8da20250a572463b1233c83a38721c41a6dc52f)

  Previous write of size 8 at 0x7fffffff7408 by main thread:
    #0 dger_thread /project/OpenBLAS/driver/level2/ger_thread.c:148:10 (openblas_utest+0x26e56f) (BuildId: c8da20250a572463b1233c83a38721c41a6dc52f)
    #1 dger_ /project/OpenBLAS/interface/ger.c:196:5 (openblas_utest+0x26e499) (BuildId: c8da20250a572463b1233c83a38721c41a6dc52f)
    #2 dlarf_ /project/OpenBLAS/lapack-netlib/SRC/dlarf.c (openblas_utest+0x26e14d) (BuildId: c8da20250a572463b1233c83a38721c41a6dc52f)
    #3 dgebd2_ /project/OpenBLAS/lapack-netlib/SRC/dgebd2.c:779:3 (openblas_utest+0x26d59f) (BuildId: c8da20250a572463b1233c83a38721c41a6dc52f)
   ...

between: https://github.com/OpenMathLib/OpenBLAS/blob/ef9e3f71595edfc69aadeb34d99a96d5f72a29a2/driver/level2/ger_thread.c#L61 and https://github.com/OpenMathLib/OpenBLAS/blob/ef9e3f71595edfc69aadeb34d99a96d5f72a29a2/driver/level2/ger_thread.c#L148 looks real and in the level3 I saw a use of static pthread_mutex_t level3_lock for a similar situation: https://github.com/OpenMathLib/OpenBLAS/blob/ef9e3f71595edfc69aadeb34d99a96d5f72a29a2/driver/level3/level3_gemm3m_thread.c#L886

I'm also not familiar with OpenBLAS code, so may be wrong as well.

Feb 27 '25 13:02 vfdev-5

@vfdev-5 thanks. Unfortunately I'm struggling to get anything at all OpenBLAS-wise done right now, so please allow for some delays and premature comments. Also most of that is legacy GotoBLAS code that maybe got more trust than it deserves.

@hawkinsp indeed I appear to have overlooked the ATOMIC_RELAXED qualifier - or trusted the submitter to know what he was doing - when I merged the relevant PR, which would seem to have (re)introduced a race wherever C11 semantics are available (kind of tracks with newer compiler versions exposing this, I guess). Changing to ATOMIC_ACQ_REL seems to have plugged the ARMV6 hole of #5104, not sure if there is a safe middle ground performance-wise.

Feb 27 '25 15:02 martin-frbg

@hawkinsp indeed I appear to have overlooked the ATOMIC_RELAXED qualifier - or trusted the submitter to know what he was doing - when I merged the relevant PR, which would seem to have (re)introduced a race wherever C11 semantics are available (kind of tracks with newer compiler versions exposing this, I guess). Changing to ATOMIC_ACQ_REL seems to have plugged the ARMV6 hole of #5104, not sure if there is a safe middle ground performance-wise.

And indeed it would make perfect sense that such a problem only show up on ARM, which has weaker memory model than x86, say.

Feb 27 '25 15:02 hawkinsp

@hawkinsp indeed I appear to have overlooked the ATOMIC_RELAXED qualifier - or trusted the submitter to know what he was doing - when I merged the relevant PR, which would seem to have (re)introduced a race wherever C11 semantics are available (kind of tracks with newer compiler versions exposing this, I guess). Changing to ATOMIC_ACQ_REL seems to have plugged the ARMV6 hole of #5104, not sure if there is a safe middle ground performance-wise.

I suspect it's enough to use ATOMIC_ACQUIRE on the reads and ATOMIC_RELEASE on the writes, but this is hard to be sure of from a drive-by read of the code!

Feb 27 '25 15:02 hawkinsp

Possible data races detected with ThreadSanitizer in ger_thread, ...