OpenBLAS Inaccurate result for `neoversev1` `cblas_sdot` kernel only when using single thread

This issue was reported on Nvidia Grace CPU. The reproducer is in Julia, so not entirely minimal, but hopefully it should give you an idea of how to reproduce it with a direct call to the library:

$ OPENBLAS_VERBOSE=2 julia +nightly -q
julia> using Statistics, LinearAlgebra, Random

julia> Random.seed!(42);

julia> A = randn(Float32, 10_000_000);

julia> B = copy(A);

julia> BLAS.set_num_threads(2)
Core: neoversev1

julia> two_threads = BLAS.dot(A .- mean(A), B .- mean(B))
9.991972f6

julia> BLAS.set_num_threads(1)

julia> one_thread = BLAS.dot(A .- mean(A), B .- mean(B))
9.987286f6

julia> two_threads ≈ one_thread
false

julia> expected_result = sum((A .- mean(A)) .* (B .- mean(B)))
9.99456f6

julia> expected_result ≈ one_thread
false

julia> expected_result ≈ two_threads
true

The call to BLAS.dot reduces to a call to cblas_sdot, so that should be the offending OpenBLAS kernel. This is using OpenBLAS v0.3.29, haven't tried v0.3.30. For reproducing the issue, the vector needs to have 10 million elements (doesn't reproduce with 1 million or less), the two vectors need to have the same elements (but need not to be the same vector in memory, this is why I used B = copy(A)), and subtracting their average value is important, without that it doesn't reproduce, so this means there are many small numbers in the array.

This is specific to the neoversev1 type of the kernel, armv8 and neoversen1 don't seem to have this accuracy issue (and no one else reported this issue before, this code is coming from Julia's test suite, so many people have been running it for a long time):

$ OPENBLAS_VERBOSE=2 OPENBLAS_CORETYPE=armv8 julia +nightly -q
julia> using Statistics, LinearAlgebra, Random

julia> Random.seed!(42);

julia> A = randn(Float32, 10_000_000);

julia> B = copy(A);

julia> BLAS.set_num_threads(2)
Core: armv8

julia> two_threads = BLAS.dot(A .- mean(A), B .- mean(B))
9.991984f6

julia> BLAS.set_num_threads(1)

julia> one_thread = BLAS.dot(A .- mean(A), B .- mean(B))
9.991984f6

julia> two_threads ≈ one_thread
true

julia> expected_result = sum((A .- mean(A)) .* (B .- mean(B)))
9.99456f6

julia> expected_result ≈ one_thread
true

julia> expected_result ≈ two_threads
true

julia>
$ OPENBLAS_VERBOSE=2 OPENBLAS_CORETYPE=neoversen1 julia +nightly -q
julia> using Statistics, LinearAlgebra, Random

julia> Random.seed!(42);

julia> A = randn(Float32, 10_000_000);

julia> B = copy(A);

julia> BLAS.set_num_threads(2)
Core: neoversen1

julia> two_threads = BLAS.dot(A .- mean(A), B .- mean(B))
9.994247f6

julia> BLAS.set_num_threads(1)

julia> one_thread = BLAS.dot(A .- mean(A), B .- mean(B))
9.993646f6

julia> two_threads ≈ one_thread
true

julia> expected_result = sum((A .- mean(A)) .* (B .- mean(B)))
9.99456f6

julia> expected_result ≈ one_thread
true

julia> expected_result ≈ two_threads
true

Originally reported at https://github.com/JuliaLang/julia/issues/59664.

Sep 26 '25 18:09 giordano

haven't tried v0.3.30

I saw #5293, which looked relevant and was reportedly fixed in v0.3.30 by #5294, so I did a local build of OpenBLAS on e58f6dc50da3752e938ea6bbf411db8adbb9fe42 (make DYNAMIC_ARCH=1 LIBPREFIX=libopenblas64_ INTERFACE64=1 SYMBOLSUFFIX=64_ NUM_THREADS=72 -j36) but I can still reproduce the issue:

$ OPENBLAS_VERBOSE=2 julia +nightly -q
julia> using LinearAlgebra

julia> BLAS.lbt_forward("./libopenblas64_.so"; clear=true)
Core: neoversev1
Core: neoversev2
5063

julia> using Statistics, LinearAlgebra, Random

julia> Random.seed!(42);

julia> A = randn(Float32, 10_000_000);

julia> B = copy(A);

julia> BLAS.set_num_threads(2)

julia> two_threads = BLAS.dot(A .- mean(A), B .- mean(B))
9.991972f6

julia> BLAS.set_num_threads(1)

julia> one_thread = BLAS.dot(A .- mean(A), B .- mean(B))
9.987286f6

julia> two_threads ≈ one_thread
false

julia> expected_result = sum((A .- mean(A)) .* (B .- mean(B)))
9.99456f6

julia> expected_result ≈ one_thread
false

julia> expected_result ≈ two_threads
true

Sep 26 '25 19:09 giordano

#5293/#5294 specifically concerned the non-SVE dot kernel so would be unrelated. The SVE kernel (kernel/arm64/dot.c) appears to have had no relevant changes since its original introduction in 2022 (#3842 , released with 0.3.22) and on first glance I do not see anything that would give its single-threaded invocation reduced accuracy compared to multithread. (If anything, multithreaded might use excess precision in the intermediate results of its threads, but if I read the code correctly this would be cast to single precision before the summation. I wonder if this is simply the algorithmic difference in the order the calculations on small numbers are performed (with the inevitable rounding)

Sep 26 '25 21:09 martin-frbg

It's definitely not awfully off, but the difference is observable.

If you have a chance to test it yourself, this is the input data (this is already A - mean(A), so you can pass it directly as input to cblas_sdot without anything else to do):

I chunked up the file into to pieces because GitHub wouldn't allow me to upload a single file of 40 MB (and compression doesn't help much), you can cat these two files in this order to reconstruct the full array. This is simply a binary file of floats, you should be able to read it into an array with fread.

Sep 26 '25 21:09 giordano

In the clobber list in the SVE code I dont see d0 listed but its used in the code. Should it be included?

Sep 26 '25 23:09 green-br

The sample code is using Float32, so actually it would be single precision. I don't see any of the s registers in the clobber list for the kernel, actually, so neither s0 nor s1 are marked.

And also, it looks like none of the single precision registers are getting zeroed when the function starts, but d1 is always getting set to zero (even in the single precision kernel). (and also d0 is never zeroed, even though it is the target of the final SUM_VECTOR).

Perhaps this set of changes would fix it?

diff --git a/kernel/arm64/dot_kernel_sve.c b/kernel/arm64/dot_kernel_sve.c
index bc9975214..174116675 100644
--- a/kernel/arm64/dot_kernel_sve.c
+++ b/kernel/arm64/dot_kernel_sve.c
@@ -63,7 +63,8 @@ THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 "        mov     z1.d, #0                             \n" \
 "        mov     z0.d, #0                             \n" \
 "        mov     x8, #0                               \n" \
-"        movi    d1, #0x0                             \n" \
+"        movi    "DTYPE"0, #0x0                       \n" \
+"        movi    "DTYPE"1, #0x0                       \n" \
         SETUP_TRUE                                        \
 "        neg     x10, x9, lsl #1                      \n" \
 "        ands    x11, x10, x0                         \n" \
@@ -111,7 +112,8 @@ dot_kernel_sve(BLASLONG n, FLOAT* x, FLOAT* y)
       : "cc",
         "memory",
         "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7",
-        "x8", "x9", "x10", "x11", "x12", "x13", "d1", 
+        "x8", "x9", "x10", "x11", "x12", "x13",
+        DTYPE"0", DTYPE"1", 
         "z0", "z1"
   );

Sep 27 '25 00:09 imciner2

Sadly no, I applied that patch locally but when using a single thread I still get the slightly inaccurate result (this is the same as above)

julia> BLAS.lbt_forward("./libopenblas64_.so"; clear=true)
Core: neoversev2
5063

julia> BLAS.set_num_threads(1)

julia> one_thread = BLAS.dot(A .- mean(A), B .- mean(B))
9.987286f6

Sep 27 '25 09:09 giordano

I guess one could try adding z2 and z3 to the clobber list, but fundamentally I do not see how a trashed register could lead to only "slightly" off results and in the single-threaded case in particular

Sep 27 '25 10:09 martin-frbg

After talking offline with @imciner2 and playing more with this, I'm getting convinced that scalar product of such a long vector with itself is just hard to do in single precision in one go, and splitting the sum helps:

julia> using Statistics, LinearAlgebra, Random

julia> Random.seed!(42);

julia> A = let tmp = randn(Float32, 10_000_000); tmp .-= mean(tmp); end;

julia> expected_result = sum(abs2, A)
9.99456f6

julia> BLAS.set_num_threads(1)

julia> dot(A, A)
9.987286f6

julia> dot(A[begin:end÷2], A[begin:end÷2]) + dot(A[end÷2+1:end], A[end÷2+1:end])
9.991972f6

julia> BLAS.set_num_threads(2)

julia> dot(A, A)
9.991972f6

julia> dot(A[begin:end÷2], A[begin:end÷2]) + dot(A[end÷2+1:end], A[end÷2+1:end])
9.99365f6

julia> BLAS.set_num_threads(10)

julia> dot(A, A)
9.994369f6

julia> dot(A[begin:end÷2], A[begin:end÷2]) + dot(A[end÷2+1:end], A[end÷2+1:end])
9.994488f6

Splitting the operation, either by using multiple threads or by doing the operation on two halves of the vector, or both, gets more and more accurate results.

Also, to confirm that there's nothing intrinsically bad with one thread, if I use a 100-million-element vector, also 2 to 10 threads aren't sufficient to get an accurate result and one needs to go even higher to achieve better accuracy:

julia> Random.seed!(42);

julia> A = let tmp = randn(Float32, 100_000_000); tmp .-= mean(tmp); end;

julia> expected_result = sum(abs2, A)
1.00027256f8

julia> BLAS.set_num_threads(1)

julia> dot(A, A)
9.784091f7

julia> BLAS.set_num_threads(2)

julia> dot(A, A)
9.923342f7

julia> BLAS.set_num_threads(10)

julia> dot(A, A)
9.995446f7

julia> BLAS.set_num_threads(20)

julia> dot(A, A)
1.0000153f8

Sep 27 '25 13:09 giordano

Is there a fundamental reason why you are using sdot instead of dsdot (which would at least accumulate in double precision) or switching to Float64 for all - in particular if you know that you have a "crazily" long vector with many/mostly tiny elements ?

Sep 27 '25 13:09 martin-frbg

That was a reduced code from a test failure reported in https://github.com/JuliaLang/julia/issues/59664 for Julia v1.11, test failure which I couldn't reproduce in the development version of Julia because I believe the offending code has been changed since and doesn't go through sdot anymore, but I wanted to investigate why the dot call was getting inaccurate results.

Sep 27 '25 13:09 giordano

Ah, ok. I wonder if it would make sense for "someone" to compile a page of examples of what to avoid. I guess there should be ample material hidden across Reference-LAPACK, OpenBLAS, Julia and *py issue trackers alone - but then maybe such a page already exists somewhere ?

Sep 27 '25 14:09 martin-frbg