OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

SNRM2 has low precision on Apple ARM64

Open araujoms opened this issue 2 months ago • 12 comments

Forwarded bug report from Julia: https://github.com/JuliaLang/LinearAlgebra.jl/issues/1463

The error we get in SNRM2 is much higher than it should when computed on Apple ARM64. We tried the same thing on AMD and Intel, no problems, and also Apple Accelerate, no problems either.

Reproducing it is simple:

using LinearAlgebra
xs = randn(Float32, 10^6)
a = norm(xs)
b = norm(Float64.(xs))
abs(a-b)

If you're not familiar with the syntax, this code is generating a normally distributed vector with 10^6 32-bit floats, computing its norm with SNRM2, converting it to 64 bits, computing its norm with DNRM2, and computing the difference. It should be around 10^-5, but it is around 10^-1.

araujoms avatar Oct 10 '25 17:10 araujoms

Hmm - all I can say offhand is that it should be using the same 10 years old assembly kernel as pretty much all non-SVE arm64 targets

martin-frbg avatar Oct 10 '25 18:10 martin-frbg

It is a subtle bug, so it doesn't surprise me that it went unnoticed for 10 years.

araujoms avatar Oct 10 '25 19:10 araujoms

FWIW I see the same problem on Raspi5/ubuntu:

$ OPENBLAS_VERBOSE=2 ./julia
Falling back to generic ARMV8 core
Core: armv8
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.12.0 (2025-10-07)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org release
|__/                   |

...
julia> a = norm(xs)
999.8739f0

julia> b = norm(Float64.(xs))
1000.1383812451862

julia> abs(a-b)
0.2644798779987241

(using pre-compiled binary from https://julialang.org/downloads/)

dasergatskov avatar Oct 16 '25 15:10 dasergatskov

Thanks for this bug report we found out that octave (that uses hand-written code with naive summation) has the same large error. Replacing this code with Kahan summation or with calls to blas NRM2 results in a-b == 0 (yes also on ARM64/Openblas). So there must be something else is going on. Could it be BLAS vs cblas difference (Octave uses Fortran's BLAS interface).

octave:1> xs=randn(1,1e6, "single");
octave:2> a = norm(xs)
a = 1003.4
octave:3> b = norm(double(a))
b = 1003.4
octave:4> format long
octave:5> a-b
ans = 0
octave:6> version -blas
ans = FlexiBLAS Version 3.4.5
OpenBLAS (config: OpenBLAS 0.3.30 DYNAMIC_ARCH NO_AFFINITY USE_OPENMP neoversen1 MAX_THREADS=56)

dasergatskov avatar Oct 16 '25 15:10 dasergatskov

That's a typo, you're computing the norm of a in 64 bits, not the norm of xs. It's impossible for the difference to be exactly zero.

araujoms avatar Oct 16 '25 16:10 araujoms

You right! The same difference as in Julia:

octave:9> a = norm(xs)
a = 1003.4290
octave:10> b = norm(double(xs))
b = 1003.702912607971
octave:11> a - b
ans = -0.2739868

dasergatskov avatar Oct 16 '25 16:10 dasergatskov

seeing that you're using flexiblas, can you swap OpenBLAS for the Reference ("netlib") BLAS to see if that changes anything ? (They moved to a different, supposedly better algorithm for NRM2 some time ago)

martin-frbg avatar Oct 16 '25 16:10 martin-frbg

Note, for Julia users you can get the netlib reference BLAS and test this by using this code to switch the BLAS backend:

pkg> add ReferenceBLAS_jll

julia> using Linear Algebra, ReferenceBLAS_jll

julia> BLAS.lbt_forward(ReferenceBLAS_jll.libblas_path, clear=true)

imciner2 avatar Oct 16 '25 16:10 imciner2

I see the same large difference on MacOS with NETLIB (in Octave).

With julia on Raspi5 I see:

julia> using LinearAlgebra, ReferenceBLAS_jll

julia> BLAS.lbt_forward(ReferenceBLAS_jll.libblas_path, clear=true)
155

julia> xs = randn(Float32, 10^6)
1000000-element Vector{Float32}:
 -0.63251835
...
julia> a = norm(xs)
Error: no BLAS/LAPACK library loaded for snrm2_64_()
1.4324054f-16

julia> b = norm(Float64.(xs))
Error: no BLAS/LAPACK library loaded for dnrm2_64_()
9.532824124368238e-130

P.S.: Perhaps some of those commands were redundand?

julia> using LinearAlgebra, ReferenceBLAS_jll

julia> BLAS.vendor
vendor (generic function with 1 method)

julia> xs = randn(Float32, 10^6)
1000000-element Vector{Float32}:
  0.31361395
...
 -1.1724278

julia> a = norm(xs)
998.86597f0

julia> b = norm(Float64.(xs))
999.1293504580417

julia> a - b
-0.2633836611666993

dasergatskov avatar Oct 16 '25 16:10 dasergatskov

Oh, actually the Julia code should be slightly modified - I didn't realize we build with both ILP64 and LP64 in there

pkg> add ReferenceBLAS_jll

julia> using LinearAlgebra, ReferenceBLAS_jll

julia> BLAS.lbt_forward(ReferenceBLAS_jll.libblas_path, clear=true)

julia> BLAS.lbt_forward(ReferenceBLAS_jll.libblas_path, suffix_hint="64_")

imciner2 avatar Oct 16 '25 16:10 imciner2

I am pretty sure there should be no space in "LinearAlgebra". Anyway it is the same:

julia> using LinearAlgebra, ReferenceBLAS_jll

julia> BLAS.lbt_forward(ReferenceBLAS_jll.libblas_path, clear=true)
155

julia> BLAS.lbt_forward(ReferenceBLAS_jll.libblas_path, suffix_hint="64_")
155

julia> xs = randn(Float32, 10^6)
1000000-element Vector{Float32}:
  0.503
...
 -1.5424652

julia> a = norm(xs)
999.34644f0

julia> b = norm(Float64.(xs))
999.5703515232086

julia> a - b
-0.2239159763336147

julia> BLAS.vendor()
:lbt

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries: 
├ [ LP64] libblas.so
└ [ILP64] libblas.so

dasergatskov avatar Oct 16 '25 17:10 dasergatskov

That's a typo, you're computing the norm of a in 64 bits, not the norm of xs. It's impossible for the difference to be exactly zero.

Matlab ( and hence Octave) demotes "single"/"double" OPs. I.e. if a is "single" and b is "double", c = a -b is "single". So it can be exactly "0". E.g. this is on Octave with hand-written norm code using Kahan summation:

octave:1> xs=randn(1,1e6, "single");
octave:2> a = norm(xs)
a = 1003.5
octave:3> b = norm(double(xs))
b = 1003.5
octave:4> format bit
octave:5> a
a = 01000100011110101110000000101101
octave:6> b
b = 0100000010001111010111000000010110100001010010110001001111010110
octave:7> c = a -b
c = 00000000000000000000000000000000
octave:8> cc = double(a) -b
cc = 1011111011000100101100010011110101100000000000000000000000000000
octave:9> format long
octave:10> cc
cc = -2.466719479343737e-06
octave:11> c
c = 0

dasergatskov avatar Oct 16 '25 17:10 dasergatskov