Inaccurate result for `neoversev1` `cblas_sdot` kernel only when using single thread
This issue was reported on Nvidia Grace CPU. The reproducer is in Julia, so not entirely minimal, but hopefully it should give you an idea of how to reproduce it with a direct call to the library:
$ OPENBLAS_VERBOSE=2 julia +nightly -q
julia> using Statistics, LinearAlgebra, Random
julia> Random.seed!(42);
julia> A = randn(Float32, 10_000_000);
julia> B = copy(A);
julia> BLAS.set_num_threads(2)
Core: neoversev1
julia> two_threads = BLAS.dot(A .- mean(A), B .- mean(B))
9.991972f6
julia> BLAS.set_num_threads(1)
julia> one_thread = BLAS.dot(A .- mean(A), B .- mean(B))
9.987286f6
julia> two_threads ≈ one_thread
false
julia> expected_result = sum((A .- mean(A)) .* (B .- mean(B)))
9.99456f6
julia> expected_result ≈ one_thread
false
julia> expected_result ≈ two_threads
true
The call to BLAS.dot reduces to a call to cblas_sdot, so that should be the offending OpenBLAS kernel. This is using OpenBLAS v0.3.29, haven't tried v0.3.30. For reproducing the issue, the vector needs to have 10 million elements (doesn't reproduce with 1 million or less), the two vectors need to have the same elements (but need not to be the same vector in memory, this is why I used B = copy(A)), and subtracting their average value is important, without that it doesn't reproduce, so this means there are many small numbers in the array.
This is specific to the neoversev1 type of the kernel, armv8 and neoversen1 don't seem to have this accuracy issue (and no one else reported this issue before, this code is coming from Julia's test suite, so many people have been running it for a long time):
$ OPENBLAS_VERBOSE=2 OPENBLAS_CORETYPE=armv8 julia +nightly -q
julia> using Statistics, LinearAlgebra, Random
julia> Random.seed!(42);
julia> A = randn(Float32, 10_000_000);
julia> B = copy(A);
julia> BLAS.set_num_threads(2)
Core: armv8
julia> two_threads = BLAS.dot(A .- mean(A), B .- mean(B))
9.991984f6
julia> BLAS.set_num_threads(1)
julia> one_thread = BLAS.dot(A .- mean(A), B .- mean(B))
9.991984f6
julia> two_threads ≈ one_thread
true
julia> expected_result = sum((A .- mean(A)) .* (B .- mean(B)))
9.99456f6
julia> expected_result ≈ one_thread
true
julia> expected_result ≈ two_threads
true
julia>
$ OPENBLAS_VERBOSE=2 OPENBLAS_CORETYPE=neoversen1 julia +nightly -q
julia> using Statistics, LinearAlgebra, Random
julia> Random.seed!(42);
julia> A = randn(Float32, 10_000_000);
julia> B = copy(A);
julia> BLAS.set_num_threads(2)
Core: neoversen1
julia> two_threads = BLAS.dot(A .- mean(A), B .- mean(B))
9.994247f6
julia> BLAS.set_num_threads(1)
julia> one_thread = BLAS.dot(A .- mean(A), B .- mean(B))
9.993646f6
julia> two_threads ≈ one_thread
true
julia> expected_result = sum((A .- mean(A)) .* (B .- mean(B)))
9.99456f6
julia> expected_result ≈ one_thread
true
julia> expected_result ≈ two_threads
true
Originally reported at https://github.com/JuliaLang/julia/issues/59664.
haven't tried v0.3.30
I saw #5293, which looked relevant and was reportedly fixed in v0.3.30 by #5294, so I did a local build of OpenBLAS on e58f6dc50da3752e938ea6bbf411db8adbb9fe42 (make DYNAMIC_ARCH=1 LIBPREFIX=libopenblas64_ INTERFACE64=1 SYMBOLSUFFIX=64_ NUM_THREADS=72 -j36) but I can still reproduce the issue:
$ OPENBLAS_VERBOSE=2 julia +nightly -q
julia> using LinearAlgebra
julia> BLAS.lbt_forward("./libopenblas64_.so"; clear=true)
Core: neoversev1
Core: neoversev2
5063
julia> using Statistics, LinearAlgebra, Random
julia> Random.seed!(42);
julia> A = randn(Float32, 10_000_000);
julia> B = copy(A);
julia> BLAS.set_num_threads(2)
julia> two_threads = BLAS.dot(A .- mean(A), B .- mean(B))
9.991972f6
julia> BLAS.set_num_threads(1)
julia> one_thread = BLAS.dot(A .- mean(A), B .- mean(B))
9.987286f6
julia> two_threads ≈ one_thread
false
julia> expected_result = sum((A .- mean(A)) .* (B .- mean(B)))
9.99456f6
julia> expected_result ≈ one_thread
false
julia> expected_result ≈ two_threads
true
#5293/#5294 specifically concerned the non-SVE dot kernel so would be unrelated. The SVE kernel (kernel/arm64/dot.c) appears to have had no relevant changes since its original introduction in 2022 (#3842 , released with 0.3.22) and on first glance I do not see anything that would give its single-threaded invocation reduced accuracy compared to multithread. (If anything, multithreaded might use excess precision in the intermediate results of its threads, but if I read the code correctly this would be cast to single precision before the summation. I wonder if this is simply the algorithmic difference in the order the calculations on small numbers are performed (with the inevitable rounding)
It's definitely not awfully off, but the difference is observable.
If you have a chance to test it yourself, this is the input data (this is already A - mean(A), so you can pass it directly as input to cblas_sdot without anything else to do):
I chunked up the file into to pieces because GitHub wouldn't allow me to upload a single file of 40 MB (and compression doesn't help much), you can cat these two files in this order to reconstruct the full array. This is simply a binary file of floats, you should be able to read it into an array with fread.
In the clobber list in the SVE code I dont see d0 listed but its used in the code. Should it be included?
The sample code is using Float32, so actually it would be single precision. I don't see any of the s registers in the clobber list for the kernel, actually, so neither s0 nor s1 are marked.
And also, it looks like none of the single precision registers are getting zeroed when the function starts, but d1 is always getting set to zero (even in the single precision kernel). (and also d0 is never zeroed, even though it is the target of the final SUM_VECTOR).
Perhaps this set of changes would fix it?
diff --git a/kernel/arm64/dot_kernel_sve.c b/kernel/arm64/dot_kernel_sve.c
index bc9975214..174116675 100644
--- a/kernel/arm64/dot_kernel_sve.c
+++ b/kernel/arm64/dot_kernel_sve.c
@@ -63,7 +63,8 @@ THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
" mov z1.d, #0 \n" \
" mov z0.d, #0 \n" \
" mov x8, #0 \n" \
-" movi d1, #0x0 \n" \
+" movi "DTYPE"0, #0x0 \n" \
+" movi "DTYPE"1, #0x0 \n" \
SETUP_TRUE \
" neg x10, x9, lsl #1 \n" \
" ands x11, x10, x0 \n" \
@@ -111,7 +112,8 @@ dot_kernel_sve(BLASLONG n, FLOAT* x, FLOAT* y)
: "cc",
"memory",
"x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7",
- "x8", "x9", "x10", "x11", "x12", "x13", "d1",
+ "x8", "x9", "x10", "x11", "x12", "x13",
+ DTYPE"0", DTYPE"1",
"z0", "z1"
);
Sadly no, I applied that patch locally but when using a single thread I still get the slightly inaccurate result (this is the same as above)
julia> BLAS.lbt_forward("./libopenblas64_.so"; clear=true)
Core: neoversev2
5063
julia> BLAS.set_num_threads(1)
julia> one_thread = BLAS.dot(A .- mean(A), B .- mean(B))
9.987286f6
I guess one could try adding z2 and z3 to the clobber list, but fundamentally I do not see how a trashed register could lead to only "slightly" off results and in the single-threaded case in particular
After talking offline with @imciner2 and playing more with this, I'm getting convinced that scalar product of such a long vector with itself is just hard to do in single precision in one go, and splitting the sum helps:
julia> using Statistics, LinearAlgebra, Random
julia> Random.seed!(42);
julia> A = let tmp = randn(Float32, 10_000_000); tmp .-= mean(tmp); end;
julia> expected_result = sum(abs2, A)
9.99456f6
julia> BLAS.set_num_threads(1)
julia> dot(A, A)
9.987286f6
julia> dot(A[begin:end÷2], A[begin:end÷2]) + dot(A[end÷2+1:end], A[end÷2+1:end])
9.991972f6
julia> BLAS.set_num_threads(2)
julia> dot(A, A)
9.991972f6
julia> dot(A[begin:end÷2], A[begin:end÷2]) + dot(A[end÷2+1:end], A[end÷2+1:end])
9.99365f6
julia> BLAS.set_num_threads(10)
julia> dot(A, A)
9.994369f6
julia> dot(A[begin:end÷2], A[begin:end÷2]) + dot(A[end÷2+1:end], A[end÷2+1:end])
9.994488f6
Splitting the operation, either by using multiple threads or by doing the operation on two halves of the vector, or both, gets more and more accurate results.
Also, to confirm that there's nothing intrinsically bad with one thread, if I use a 100-million-element vector, also 2 to 10 threads aren't sufficient to get an accurate result and one needs to go even higher to achieve better accuracy:
julia> Random.seed!(42);
julia> A = let tmp = randn(Float32, 100_000_000); tmp .-= mean(tmp); end;
julia> expected_result = sum(abs2, A)
1.00027256f8
julia> BLAS.set_num_threads(1)
julia> dot(A, A)
9.784091f7
julia> BLAS.set_num_threads(2)
julia> dot(A, A)
9.923342f7
julia> BLAS.set_num_threads(10)
julia> dot(A, A)
9.995446f7
julia> BLAS.set_num_threads(20)
julia> dot(A, A)
1.0000153f8
Is there a fundamental reason why you are using sdot instead of dsdot (which would at least accumulate in double precision) or switching to Float64 for all - in particular if you know that you have a "crazily" long vector with many/mostly tiny elements ?
That was a reduced code from a test failure reported in https://github.com/JuliaLang/julia/issues/59664 for Julia v1.11, test failure which I couldn't reproduce in the development version of Julia because I believe the offending code has been changed since and doesn't go through sdot anymore, but I wanted to investigate why the dot call was getting inaccurate results.
Ah, ok. I wonder if it would make sense for "someone" to compile a page of examples of what to avoid. I guess there should be ample material hidden across Reference-LAPACK, OpenBLAS, Julia and *py issue trackers alone - but then maybe such a page already exists somewhere ?