julia icon indicating copy to clipboard operation
julia copied to clipboard

Imprecision of sum(::Generator)

Open cstjean opened this issue 6 years ago • 3 comments

It seems that sum uses the naive sequential sum algorithm for generators. With large vectors, it eventually saturates, and yields an incorrect answer:

julia> N = 100000000; aa = rand(Float32, N);

julia> mean((x for x in aa))
0.16777216f0

julia> mean(aa)
0.500059f0

I have a real-world case where it causes an alarmingly large difference:

julia> mean(skipmissing(Umat))
1.0638367f0 V

julia> mean(collect(skipmissing(Umat)))
3.1320891f0 V

As @simonbyrne pointed out on discourse, sum(::Array) already uses a smarter algorithm. It could presumably be used for generators, too.

cstjean avatar Dec 17 '18 19:12 cstjean

Although pairwise summation could in principle be used for arbitrary iterators, it requires a completely different implementation than the one we have for sum(::Array) (which requires random access).

(Removing the "missing data" label since this is about summing iterators and is not specific to missing in any way.)

stevengj avatar Dec 20 '18 03:12 stevengj

Note that you can also use the more accurate sum_kbn from KahanSummation, which used to be in Base. It's uniformly slower than sum (by a pretty significant amount) but has the same performance for arrays and general iterators.

ararslan avatar Dec 20 '18 20:12 ararslan

I know this is an old bug, but can I just point out that this is particularly unexpected with the dims= keyword argument. For example, only the second result is correct:

julia> using Statistics

julia> xyz = rand(Float32, 3, 10_000_000) .+ [0,0,1f5]

julia> mean(xyz, dims=2)
3×1 Matrix{Float32}:
      0.4999468
      0.49997416
 112835.69

julia> mean(xyz[3,:])
100000.51f0

julia> sum(xyz, dims=2) / 10_000_000
3×1 Matrix{Float32}:
      0.4999468
      0.49997416
 112835.69

I'm on julia-1.7.3.

hsgg avatar Aug 08 '22 18:08 hsgg