julia
julia copied to clipboard
Imprecision of sum(::Generator)
It seems that sum
uses the naive sequential sum algorithm for generators. With large vectors, it eventually saturates, and yields an incorrect answer:
julia> N = 100000000; aa = rand(Float32, N);
julia> mean((x for x in aa))
0.16777216f0
julia> mean(aa)
0.500059f0
I have a real-world case where it causes an alarmingly large difference:
julia> mean(skipmissing(Umat))
1.0638367f0 V
julia> mean(collect(skipmissing(Umat)))
3.1320891f0 V
As @simonbyrne pointed out on discourse, sum(::Array)
already uses a smarter algorithm. It could presumably be used for generators, too.
Although pairwise summation could in principle be used for arbitrary iterators, it requires a completely different implementation than the one we have for sum(::Array)
(which requires random access).
(Removing the "missing data" label since this is about summing iterators and is not specific to missing
in any way.)
Note that you can also use the more accurate sum_kbn
from KahanSummation, which used to be in Base. It's uniformly slower than sum
(by a pretty significant amount) but has the same performance for arrays and general iterators.
I know this is an old bug, but can I just point out that this is particularly unexpected with the dims=
keyword argument. For example, only the second result is correct:
julia> using Statistics
julia> xyz = rand(Float32, 3, 10_000_000) .+ [0,0,1f5]
julia> mean(xyz, dims=2)
3×1 Matrix{Float32}:
0.4999468
0.49997416
112835.69
julia> mean(xyz[3,:])
100000.51f0
julia> sum(xyz, dims=2) / 10_000_000
3×1 Matrix{Float32}:
0.4999468
0.49997416
112835.69
I'm on julia-1.7.3.