Possible data races detected with ThreadSanitizer in ger_thread, ...
I built TSAN instrumented OpenBLAS with clang-18, here is the build command:
cd /project/OpenBLAS
export CC=clang-18
export CXX=clang++-18
export CFLAGS="-fsanitize=thread -g"
export CXXFLAGS="-fsanitize=thread -g"
export LDFLAGS=-fsanitize=thread
apt-get update && apt-get install -y perl --no-install-recommends
make -j 32 shared
make -j 32 install
and then I run utests with few threads:
OMP_NUM_THREADS=2 make -C utest &> make_utest.nt-2.log
ThreadSanitizer reports various data races: https://gist.github.com/vfdev-5/7e57d305f100003aa4b9ada11efbcf0e#file-make_utest-nt-2-log
Environment info:
- linux, Ubuntu 24.04.1 LTS
# Makefile.conf
OSNAME=Linux
ARCH=x86_64
C_COMPILER=CLANG
BINARY32=
BINARY64=1
CEXTRALIB= -L/usr/bin/../lib/gcc/x86_64-linux-gnu/14 -L/usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../lib64 -L/lib/x86_64-linux-gnu -L/lib/../lib64 -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib64 -L/lib -L/usr/lib -lpthread -lrt -lm -ldl -lresolv -lc /usr/lib/llvm-18/lib/clang/18/lib/linux/libclang_rt.tsan-x86_64.a
F_COMPILER=GFORTRAN
FC=gfortran
BU=_
NOFORTRAN=1
CORE=ZEN
LIBCORE=zen
NUM_CORES=128
HAVE_MMX=1
HAVE_SSE=1
HAVE_SSE2=1
HAVE_SSE3=1
HAVE_SSSE3=1
HAVE_SSE4_1=1
HAVE_SSE4_2=1
HAVE_SSE4A=1
HAVE_AVX=1
HAVE_AVX2=1
HAVE_FMA3=1
MAKEFLAGS += -j 128
SBGEMM_UNROLL_M=8
SBGEMM_UNROLL_N=4
SGEMM_UNROLL_M=8
SGEMM_UNROLL_N=4
DGEMM_UNROLL_M=4
DGEMM_UNROLL_N=8
QGEMM_UNROLL_M=2
QGEMM_UNROLL_N=2
CGEMM_UNROLL_M=8
CGEMM_UNROLL_N=2
ZGEMM_UNROLL_M=4
ZGEMM_UNROLL_N=2
XGEMM_UNROLL_M=1
XGEMM_UNROLL_N=1
CGEMM3M_UNROLL_M=8
CGEMM3M_UNROLL_N=4
ZGEMM3M_UNROLL_M=4
ZGEMM3M_UNROLL_N=4
XGEMM3M_UNROLL_M=2
XGEMM3M_UNROLL_N=2
Thanks @vfdev-5, this is useful. I noticed that a number of the issues are present in ger_thread, and ger* is the only routine where I found issues when doing parallel testing of scipy.linalg: https://github.com/scipy/scipy/issues/21936. It's possible that that's a coincidence, but I'll try to dig into that a bit this week.
@martin-frbg for context: in Python there's a significant effort happening to remove the GIL (Global Interpreter Lock), so a free-threading CPython interpreter now allows calling Python functions in parallel threads at the same time. Meaning numpy.linalg and scipy.linalg (and PyTorch, and other packages with direct BLAS/LAPACK bindings) are going to be calling OpenBLAS from parallel threads in the same process. TSAN is used to detect potential issues with that.
Thanks - I'm looking into level 2 BLAS thread safety in conjunction with #5104 - most attention had been on level 3 and the thread server code in the past
The tsan log does not make much sense to me. Can't say I'm much happier with helgrind though, as it flags a couple of atomic operations in blas_server.c and "possible races" between Fortran and C accessing the work array
The tsan log does not make much sense to me. Can't say I'm much happier with helgrind though, as it flags a couple of atomic operations in blas_server.c and "possible races" between Fortran and C accessing the work array
I'm not familiar with your code so my diagnosis may be wrong, but the tsan race looks right to me.
The read here https://github.com/OpenMathLib/OpenBLAS/blob/ef9e3f71595edfc69aadeb34d99a96d5f72a29a2/driver/others/blas_server.c#L1056
and the write here: https://github.com/OpenMathLib/OpenBLAS/blob/ef9e3f71595edfc69aadeb34d99a96d5f72a29a2/driver/level2/ger_thread.c#L180
are racy because I don't think we've done anything to order the reader and the writer.
The only synchronization is via the queue pointer, I think, but you use relaxed reads and writes which, critically, do not imply ordering of any other reads or writes, even from the same thread, with respect to the atomic read/write. I think you need acquire/release ordering ?
The tsan log does not make much sense to me. Can't say I'm much happier with helgrind though, as it flags a couple of atomic operations in blas_server.c and "possible races" between Fortran and C accessing the work array
I have an impression that the race reported as
WARNING: ThreadSanitizer: data race (pid=1458024)
Read of size 8 at 0x7fffffff7408 by thread T1:
#0 ger_kernel /project/OpenBLAS/driver/level2/ger_thread.c:61:24 (openblas_utest+0x26e79d) (BuildId: c8da20250a572463b1233c83a38721c41a6dc52f)
#1 exec_threads /project/OpenBLAS/driver/others/blas_server.c:1148:3 (openblas_utest+0x14a05a) (BuildId: c8da20250a572463b1233c83a38721c41a6dc52f)
#2 blas_thread_server /project/OpenBLAS/driver/others/blas_server.c:463:5 (openblas_utest+0x1498f7) (BuildId: c8da20250a572463b1233c83a38721c41a6dc52f)
Previous write of size 8 at 0x7fffffff7408 by main thread:
#0 dger_thread /project/OpenBLAS/driver/level2/ger_thread.c:148:10 (openblas_utest+0x26e56f) (BuildId: c8da20250a572463b1233c83a38721c41a6dc52f)
#1 dger_ /project/OpenBLAS/interface/ger.c:196:5 (openblas_utest+0x26e499) (BuildId: c8da20250a572463b1233c83a38721c41a6dc52f)
#2 dlarf_ /project/OpenBLAS/lapack-netlib/SRC/dlarf.c (openblas_utest+0x26e14d) (BuildId: c8da20250a572463b1233c83a38721c41a6dc52f)
#3 dgebd2_ /project/OpenBLAS/lapack-netlib/SRC/dgebd2.c:779:3 (openblas_utest+0x26d59f) (BuildId: c8da20250a572463b1233c83a38721c41a6dc52f)
...
between:
https://github.com/OpenMathLib/OpenBLAS/blob/ef9e3f71595edfc69aadeb34d99a96d5f72a29a2/driver/level2/ger_thread.c#L61
and
https://github.com/OpenMathLib/OpenBLAS/blob/ef9e3f71595edfc69aadeb34d99a96d5f72a29a2/driver/level2/ger_thread.c#L148
looks real and in the level3 I saw a use of static pthread_mutex_t level3_lock for a similar situation:
https://github.com/OpenMathLib/OpenBLAS/blob/ef9e3f71595edfc69aadeb34d99a96d5f72a29a2/driver/level3/level3_gemm3m_thread.c#L886
I'm also not familiar with OpenBLAS code, so may be wrong as well.
@vfdev-5 thanks. Unfortunately I'm struggling to get anything at all OpenBLAS-wise done right now, so please allow for some delays and premature comments. Also most of that is legacy GotoBLAS code that maybe got more trust than it deserves.
@hawkinsp indeed I appear to have overlooked the ATOMIC_RELAXED qualifier - or trusted the submitter to know what he was doing - when I merged the relevant PR, which would seem to have (re)introduced a race wherever C11 semantics are available (kind of tracks with newer compiler versions exposing this, I guess). Changing to ATOMIC_ACQ_REL seems to have plugged the ARMV6 hole of #5104, not sure if there is a safe middle ground performance-wise.
@hawkinsp indeed I appear to have overlooked the ATOMIC_RELAXED qualifier - or trusted the submitter to know what he was doing - when I merged the relevant PR, which would seem to have (re)introduced a race wherever C11 semantics are available (kind of tracks with newer compiler versions exposing this, I guess). Changing to ATOMIC_ACQ_REL seems to have plugged the ARMV6 hole of #5104, not sure if there is a safe middle ground performance-wise.
And indeed it would make perfect sense that such a problem only show up on ARM, which has weaker memory model than x86, say.
@hawkinsp indeed I appear to have overlooked the ATOMIC_RELAXED qualifier - or trusted the submitter to know what he was doing - when I merged the relevant PR, which would seem to have (re)introduced a race wherever C11 semantics are available (kind of tracks with newer compiler versions exposing this, I guess). Changing to ATOMIC_ACQ_REL seems to have plugged the ARMV6 hole of #5104, not sure if there is a safe middle ground performance-wise.
I suspect it's enough to use ATOMIC_ACQUIRE on the reads and ATOMIC_RELEASE on the writes, but this is hard to be sure of from a drive-by read of the code!