SNRM2 has low precision on Apple ARM64
Forwarded bug report from Julia: https://github.com/JuliaLang/LinearAlgebra.jl/issues/1463
The error we get in SNRM2 is much higher than it should when computed on Apple ARM64. We tried the same thing on AMD and Intel, no problems, and also Apple Accelerate, no problems either.
Reproducing it is simple:
using LinearAlgebra
xs = randn(Float32, 10^6)
a = norm(xs)
b = norm(Float64.(xs))
abs(a-b)
If you're not familiar with the syntax, this code is generating a normally distributed vector with 10^6 32-bit floats, computing its norm with SNRM2, converting it to 64 bits, computing its norm with DNRM2, and computing the difference. It should be around 10^-5, but it is around 10^-1.
Hmm - all I can say offhand is that it should be using the same 10 years old assembly kernel as pretty much all non-SVE arm64 targets
It is a subtle bug, so it doesn't surprise me that it went unnoticed for 10 years.
FWIW I see the same problem on Raspi5/ubuntu:
$ OPENBLAS_VERBOSE=2 ./julia
Falling back to generic ARMV8 core
Core: armv8
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.12.0 (2025-10-07)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org release
|__/ |
...
julia> a = norm(xs)
999.8739f0
julia> b = norm(Float64.(xs))
1000.1383812451862
julia> abs(a-b)
0.2644798779987241
(using pre-compiled binary from https://julialang.org/downloads/)
Thanks for this bug report we found out that octave (that uses hand-written code with naive summation) has the same large error. Replacing this code with Kahan summation or with calls to blas NRM2 results in a-b == 0 (yes also on ARM64/Openblas). So there must be something else is going on. Could it be BLAS vs cblas difference (Octave uses Fortran's BLAS interface).
octave:1> xs=randn(1,1e6, "single");
octave:2> a = norm(xs)
a = 1003.4
octave:3> b = norm(double(a))
b = 1003.4
octave:4> format long
octave:5> a-b
ans = 0
octave:6> version -blas
ans = FlexiBLAS Version 3.4.5
OpenBLAS (config: OpenBLAS 0.3.30 DYNAMIC_ARCH NO_AFFINITY USE_OPENMP neoversen1 MAX_THREADS=56)
That's a typo, you're computing the norm of a in 64 bits, not the norm of xs. It's impossible for the difference to be exactly zero.
You right! The same difference as in Julia:
octave:9> a = norm(xs)
a = 1003.4290
octave:10> b = norm(double(xs))
b = 1003.702912607971
octave:11> a - b
ans = -0.2739868
seeing that you're using flexiblas, can you swap OpenBLAS for the Reference ("netlib") BLAS to see if that changes anything ? (They moved to a different, supposedly better algorithm for NRM2 some time ago)
Note, for Julia users you can get the netlib reference BLAS and test this by using this code to switch the BLAS backend:
pkg> add ReferenceBLAS_jll
julia> using Linear Algebra, ReferenceBLAS_jll
julia> BLAS.lbt_forward(ReferenceBLAS_jll.libblas_path, clear=true)
I see the same large difference on MacOS with NETLIB (in Octave).
With julia on Raspi5 I see:
julia> using LinearAlgebra, ReferenceBLAS_jll
julia> BLAS.lbt_forward(ReferenceBLAS_jll.libblas_path, clear=true)
155
julia> xs = randn(Float32, 10^6)
1000000-element Vector{Float32}:
-0.63251835
...
julia> a = norm(xs)
Error: no BLAS/LAPACK library loaded for snrm2_64_()
1.4324054f-16
julia> b = norm(Float64.(xs))
Error: no BLAS/LAPACK library loaded for dnrm2_64_()
9.532824124368238e-130
P.S.: Perhaps some of those commands were redundand?
julia> using LinearAlgebra, ReferenceBLAS_jll
julia> BLAS.vendor
vendor (generic function with 1 method)
julia> xs = randn(Float32, 10^6)
1000000-element Vector{Float32}:
0.31361395
...
-1.1724278
julia> a = norm(xs)
998.86597f0
julia> b = norm(Float64.(xs))
999.1293504580417
julia> a - b
-0.2633836611666993
Oh, actually the Julia code should be slightly modified - I didn't realize we build with both ILP64 and LP64 in there
pkg> add ReferenceBLAS_jll
julia> using LinearAlgebra, ReferenceBLAS_jll
julia> BLAS.lbt_forward(ReferenceBLAS_jll.libblas_path, clear=true)
julia> BLAS.lbt_forward(ReferenceBLAS_jll.libblas_path, suffix_hint="64_")
I am pretty sure there should be no space in "LinearAlgebra". Anyway it is the same:
julia> using LinearAlgebra, ReferenceBLAS_jll
julia> BLAS.lbt_forward(ReferenceBLAS_jll.libblas_path, clear=true)
155
julia> BLAS.lbt_forward(ReferenceBLAS_jll.libblas_path, suffix_hint="64_")
155
julia> xs = randn(Float32, 10^6)
1000000-element Vector{Float32}:
0.503
...
-1.5424652
julia> a = norm(xs)
999.34644f0
julia> b = norm(Float64.(xs))
999.5703515232086
julia> a - b
-0.2239159763336147
julia> BLAS.vendor()
:lbt
julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
├ [ LP64] libblas.so
└ [ILP64] libblas.so
That's a typo, you're computing the norm of
ain 64 bits, not the norm ofxs. It's impossible for the difference to be exactly zero.
Matlab ( and hence Octave) demotes "single"/"double" OPs. I.e. if a is "single" and b is "double", c = a -b is "single".
So it can be exactly "0". E.g. this is on Octave with hand-written norm code using Kahan summation:
octave:1> xs=randn(1,1e6, "single");
octave:2> a = norm(xs)
a = 1003.5
octave:3> b = norm(double(xs))
b = 1003.5
octave:4> format bit
octave:5> a
a = 01000100011110101110000000101101
octave:6> b
b = 0100000010001111010111000000010110100001010010110001001111010110
octave:7> c = a -b
c = 00000000000000000000000000000000
octave:8> cc = double(a) -b
cc = 1011111011000100101100010011110101100000000000000000000000000000
octave:9> format long
octave:10> cc
cc = -2.466719479343737e-06
octave:11> c
c = 0