StatsBase.jl icon indicating copy to clipboard operation
StatsBase.jl copied to clipboard

Wrong weighted sum and mean results for complex numbers

Open aplavin opened this issue 6 years ago • 63 comments

Weighted mean gives wrong (conjugated) result for complex arrays: mean([1 + 0.5im], weights([1])) == 1.0 - 0.5im.

aplavin avatar Sep 07 '19 09:09 aplavin

Indeed. That's because we use dot([1 + 0.5im], [1]) to compute the weighted sum (for performance reasons). I guess we could simply do dot([1], [1 + 0.5im]) since only the first argument is conjugated, and we know weights are real.

Would you make a pull request to apply that change (in weights.jl) and add a test?

nalimilan avatar Sep 09 '19 13:09 nalimilan

Just ran into this in another context: #534 isn't quite the same, because it's about values that are vectors of real numbers, but the cause is the (recursive) behavior of dot(...) as well.

I think using dot in sum(A, w) is wrong, semantically. Could we change the definition of sum(A, w) to something like sum(A .* values(w)) or mapreduce(*, +, A, values(w))? I can do a PR, if there are no objections.

oschulz avatar Nov 06 '19 13:11 oschulz

The reason we use dot AFAICT is that it's much more efficient thanks to BLAS. It's not an issue to abuse it if we are sure it's correct in all cases in which we use it. What we can do is to have a method using it for arrays of numbers, and a fall back method using a slower but generally correct approach.

nalimilan avatar Nov 06 '19 16:11 nalimilan

Sounds good. Regarding on where do draw the line - BLAS would only accelerate Real, right (resp. do the wrong thing for Complex for this use case)?

So we'd do

Base.sum(v::AbstractArray{<:Real}, w::AbstractWeights) = dot(v, values(w))

Base.sum(v::AbstractArray, w::AbstractWeights) = ...

right? @nalimilan, what do you think would be best for the generic implementation:

# Option 1:
sum(A .* values(w))

# Option 2:
sum(Base.Broadcast.Broadcasted(*, (A, values(w))))

# Option 3:
mapreduce(*, +, A, values(w))

oschulz avatar Nov 06 '19 16:11 oschulz

I've been thinking - maybe we should drop use of dot altogether, though, even if it means a performance loss, since we'd gain precision. AFAIK, Julia's sum() uses pairwise summation, to keep sums stable when summing over values with large differences in magnitude. But I don't think BLAS does fancy summing:

julia> A = Float32.(vcat(1.0, fill(1E-10, 10^6)))
julia> w = Weights(fill(Float32(1.0), length(A)))

julia> sum(A)
1.0000999f0

julia> sum(A, w), dot(A, values(w))
(1.0000969f0, 1.0000969f0)

julia> mapreduce(*, +, A, values(w))
1.0000999f0

Though I just saw that we're using a simple sum in StatsBase._moment2 instead of Welford's algorithm or similar - is that intentional? Statistics._varm uses centralize_sumabs2 for stability.

oschulz avatar Nov 06 '19 17:11 oschulz

Options 2 and 3 are the best ones. Not sure how to choose between them, I guess benchmarking is needed.

mapreduce(*, +, A, values(w)) doesn't use pairwise summation either, so I think using BLAS is better. Can you check whether sum(Base.Broadcast.Broadcasted(*, (A, values(w)))) use pairwise summation?

IIRC we use Welford's algorithm when possible in var, but not always. Please file a separate issue if that's not the case.

nalimilan avatar Nov 06 '19 17:11 nalimilan

mapreduce(*, +, A, values(w)) doesn't use pairwise summation either [...]

Yes, it does (AFAIK the default Base.sum is actually defined via `mapreduce, which operates pairwise). Here's a more extreme example with a significant differenct in result:

julia> A = vcat([Float32(1E0)], fill(Float32(1E-8), 10^8));

julia> w = Weights(fill(Float32(1.0), length(A)));

julia> sum(A)
1.9999989f0

julia> mapreduce(*, +, A, w)
1.9999989f0

julia> sum(A, w)
1.9384409f0

julia> dot(A, values(w))
1.9384409f0

IIRC we use Welford's algorithm when possible in var, but not always. Please file a separate issue if that's not the case.

Oh, sure - I just stumbled over it while looking into sum and wasn't sure if it was intentional.

oschulz avatar Nov 06 '19 17:11 oschulz

Actually mapreduce(*, +, A, w) allocates a copy to store the result, so that's not an option.

nalimilan avatar Nov 06 '19 18:11 nalimilan

Actually mapreduce(*, +, A, w) allocates a copy to store the result, so that's not an option.

Oops, indeed - I naively thought it would be allocation-free. In that case, mapreduce is out, of course.

oschulz avatar Nov 06 '19 21:11 oschulz

sum(Base.Broadcast.Broadcasted(*, (A, values(w)))) seems to be allocation-free, but it's fairly slow.

oschulz avatar Nov 06 '19 21:11 oschulz

Ok, so maybe we do

Base.sum(v::AbstractArray{<:Real}, w::AbstractWeights) = dot(v, values(w))

Base.sum(v::AbstractArray, w::AbstractWeights) = sum(Base.Broadcast.Broadcasted(*, (A, values(w))))

for now, to get correct behavior in the general case without performance loss in the specific case (real numbers), and worry about precision improvement in a separate issue? Would that be fine with you, @nalimilan?

oschulz avatar Nov 06 '19 22:11 oschulz

As I noted in my first comment, I think this will work:

Base.sum(v::AbstractArray{<:Complex}, w::AbstractWeights) = dot(w, v)

For the general case, it looks like broadcasting is slower than the dot fallback:

julia> x = rand(10_000);

julia> z = Vector{Union{Float64, Missing}}(x);

julia> @btime dot(x, z);
  13.443 μs (1 allocation: 16 bytes)

julia> f(A, w) = sum(Base.Broadcast.Broadcasted(*, (A, w)))
f (generic function with 1 method)

julia> @btime f(x, z);
  22.258 μs (11 allocations: 240 bytes)

Since we don't actually need the broadcasting features, I guess we should simply use a loop similar to dot, but without calling dot recursively:

function dot(x::AbstractArray, y::AbstractArray)
    lx = length(x)
    if lx != length(y)
        throw(DimensionMismatch("first array has length $(lx) which does not match the length of the second, $(length(y))."))
    end
    if lx == 0
        return dot(zero(eltype(x)), zero(eltype(y)))
    end
    s = zero(dot(first(x), first(y)))
    for (Ix, Iy) in zip(eachindex(x), eachindex(y))
        @inbounds s += dot(x[Ix], y[Iy])
    end
    s
end

nalimilan avatar Nov 07 '19 09:11 nalimilan

I guess we should simply use a loop similar to dot, but without calling dot recursively

You mean something like this?

function dotnr_loop(x::AbstractArray, y::AbstractArray)
    lx = length(x)
    if lx != length(y)
        throw(DimensionMismatch("first array has length $(lx) which does not match the length of the second, $(length(y))."))
    end
    if lx == 0
        return zero(eltype(x)) * zero(eltype(y))
    end
    s = zero(first(x) * first(y))
    for (Ix, Iy) in zip(eachindex(x), eachindex(y))
        @inbounds s += x[Ix] * y[Iy]
    end
    s
end

For the general case, it looks like broadcasting is slower than the dot fallback

Sure, where dot can be used, we should use dot. But for the generic use case, it's not so clear to me that a hard-coded loop is superior to a broadcast-based implementation.

With my use case from #534 (computing the weighted center of a cluster of 3D-points), dotnr_loop is indeed somewhat faster than sum + broadcast (equivalent to reduce(+, ... + broadcast), but not dramatically:

julia> using StatsBase, Statistics, LinearAlgebra, StaticArrays, BenchmarkTools

julia> x = rand(SVector{3,Float64}, 10^3);
julia> y = rand(Float64, length(x));

julia> @benchmark dotnr_loop($x, $y)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.189 μs (0.00% GC)
  median time:      1.190 μs (0.00% GC)
  mean time:        1.203 μs (0.00% GC)
  maximum time:     2.948 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark sum(Base.Broadcast.Broadcasted(*, ($x, values($y))))
BenchmarkTools.Trial: 
  memory estimate:  48 bytes
  allocs estimate:  2
  --------------
  minimum time:     1.510 μs (0.00% GC)
  median time:      1.512 μs (0.00% GC)
  mean time:        1.518 μs (0.00% GC)
  maximum time:     3.178 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark reduce(+, Base.Broadcast.Broadcasted(*, ($x, values($y))))
BenchmarkTools.Trial: 
  memory estimate:  48 bytes
  allocs estimate:  2
  --------------
  minimum time:     1.511 μs (0.00% GC)
  median time:      1.519 μs (0.00% GC)
  mean time:        1.524 μs (0.00% GC)
  maximum time:     2.611 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

On the other hand, reduce + broadcast is simpler, more generic and will also run on a GPU (currently needs explicit init element):

dotnr_bcast(x::AbstractArray, y::AbstractArray) = reduce(
    +,
    Base.Broadcast.Broadcasted(*, (x, values(y))),
    init = zero(eltype(x)) * zero(eltype(y))
)

So when the data becomes larger

julia> using CuArrays
julia> x = rand(SVector{3,Float64}, 10^6);
julia> y = rand(Float64, length(x));
julia> cx = CuArray(x);
julia> cy = CuArray(y);
julia> CuArrays.allowscalar(false)


julia> @btime dotnr_loop($x, $y)
  1.813 ms (0 allocations: 0 bytes)
3-element SArray{Tuple{3},Float64,1,3} with indices SOneTo(3):
 249588.56051677305
 250059.7208080707 
 249841.12107313547

julia> @btime dotnr_bcast($x, $y)
  2.492 ms (2 allocations: 48 bytes)
3-element SArray{Tuple{3},Float64,1,3} with indices SOneTo(3):
 249588.56051677305
 250059.7208080707 
 249841.12107313547

julia> @btime dotnr_bcast($cx, $cy)
  63.029 μs (91 allocations: 5.14 KiB)
3-element SArray{Tuple{3},Float64,1,3} with indices SOneTo(3):
 249588.56051677145
 250059.72080806308
 249841.12107313433

then dotnr_loop with the explicit loop is still a bit faster than dotnr_bcast - but on the other hand, dotnr_bcast can run on a GPU and blow dotnr_loop out of the water.

So I wonder if we shouldn't go with the broadcast-based implementation, even if it's a bit slower on a CPU (at least for this use case it is). For eltype(x) <: Number we'd keep using dot of course, for full BLAS performance.

oschulz avatar Nov 07 '19 22:11 oschulz

Long term, we should find an alternative to dot for the "simple numbers" case as well, though, and use a more stable summing algorithm (KBN or pairwise).

oschulz avatar Nov 08 '19 08:11 oschulz

You mean something like this?

Yes.

On the other hand, reduce + broadcast is simpler, more generic and will also run on a GPU (currently needs explicit init element): ... then dotnr_loop with the explicit loop is still a bit faster than dotnr_bcast - but on the other hand, dotnr_bcast can run on a GPU and blow dotnr_loop out of the water.

Tough call. I guess we should try to figure out with broadcast is slower than a loop on CPU. A similar alternative is using a generator, but in my tests it's also slower than a loop. Can you investigate this? This is of interest for Julia in general, not only for weighted sum, especially if a custom syntax is added for lazy broadcasting (https://github.com/JuliaLang/julia/pull/31088).

Long term, we should find an alternative to dot for the "simple numbers" case as well, though, and use a more stable summing algorithm (KBN or pairwise).

Actually sum on Broadcasted objects could easily use the same pairwise algorithm as for arrays. AFAICT that's what https://github.com/JuliaLang/julia/pull/31020 implements. It would be interesting to check what's the performance on that branch, if you feel like benchmarking it.

nalimilan avatar Nov 08 '19 14:11 nalimilan

Can you investigate this? This is of interest for Julia in general, not only for weighted sum, especially if a custom syntax is added for lazy broadcasting

Well, I can try - but this may go fairly deep, we might need some code-dev experts there. :-) I assume some of the core (and not so core) developers are anyhow frequently thinking about how to improve broadcasting & friends. :-)

Actually sum on Broadcasted objects could easily use the same pairwise algorithm as for arrays.

I thought it did ... but apparently not (yet):

julia> A = vcat([Float32(1E0)], fill(Float32(1E-8), 10^8));

julia> w = Weights(fill(Float32(1.0), length(A)));

julia> reduce(+, broadcast(*, A, values(w)))
1.9999989f0

julia> reduce(+, Base.Broadcast.Broadcasted(*, (A, values(w))))
1.0f0

Oops. But as you say, maybe that's coming soon?

oschulz avatar Nov 08 '19 17:11 oschulz

Oops. But as you say, maybe that's coming soon?

At least it's implemented in JuliaLang/julia#31020, so it's quite close.

nalimilan avatar Nov 08 '19 17:11 nalimilan

OK, I've tested that branch, and summing Broadcasted is still slower. It seems that it's due to the fact that IndexStyle(::Broadcasted) = IndexCartesian(). Indeed, with a hack to force them to use IndexLinear, the speed is the same as sum, and reasonably close to dot:

julia> using BenchmarkTools

julia> x = rand(100_000);

julia> y = rand(100_000);

# This isn't correct in general, but OK for this simple case
julia> Base.LinearIndices(bc::Base.Broadcast.Broadcasted{<:Any}) = axes(bc)[1]

julia> @btime sum(x);
  16.212 μs (1 allocation: 16 bytes)

julia> @btime Base._mapreduce(identity, Base.add_sum, IndexLinear(), Base.Broadcast.Broadcasted(identity, (x,)));
  16.942 μs (3 allocations: 48 bytes)

julia> @btime Base._mapreduce(identity, Base.add_sum, IndexLinear(), Base.Broadcast.Broadcasted(*, (x, y)));
  32.016 μs (3 allocations: 64 bytes)

julia> using LinearAlgebra

julia> @btime dot(x, y);
  17.722 μs (1 allocation: 16 bytes)

Unfortunately this hack cannot be applied as-is, but it seems worth investigating whether we can ensure that Broadcasted(*, (x, y)) can use IndexLinear when both inputs are also IndexLinear. Then, combined with #31020, we could stop using dot.

nalimilan avatar Nov 08 '19 19:11 nalimilan

The advantage of using broadcast + reduce, compared to a custom implementation, would be that we'll take advantage of future improvements automatically. :-)

oschulz avatar Nov 08 '19 19:11 oschulz

Indeed, with a hack to force them to use IndexLinear, the speed is the same as sum, and reasonably close to dot

Nice!!

oschulz avatar Nov 08 '19 19:11 oschulz

IndexStyle(::Broadcasted) = IndexCartesian()

I looked for that definition, but I can't find it - is it something implicit?

I tried to run your example, but I get

julia> @btime Base._mapreduce(identity, Base.add_sum, IndexLinear(), Base.Broadcast.Broadcasted(identity, (x,)));
ERROR: MethodError: no method matching _mapreduce(::typeof(identity), ::typeof(Base.add_sum), ::IndexLinear, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(identity),Tuple{Array{Float64,1}}})

oschulz avatar Nov 08 '19 20:11 oschulz

Ah, let me guess, I need JuliaLang/julia#31020, of course?

oschulz avatar Nov 08 '19 20:11 oschulz

Hm, we can't just override Base.LinearIndices(bc::Base.Broadcast.Broadcasted{<:Any}) though, can we? ;-) And using an internal function like _mapreduce should be avoided if possible, I guess.

But maybe we could use

Base.Broadcast.Broadcasted(*, (A, w))

instead of

Base.Broadcast.Broadcasted(*, (A, values(w)))

and then add a method to IndexStyle for the specific signature of Broadcasted with Weights in it (since StatsBase owns Weights)?

Or maybe it would be possible to do this in a generic way, so that IndexStyle(::Broadcasted{...}) where [...} = IndexLinear() in more cases (e.g. all args are AbstractVectors)?

oschulz avatar Nov 08 '19 20:11 oschulz

Actually after reading more carefully https://github.com/JuliaLang/julia/pull/31020/files#r256266019 I realized it already works, one just needs to call instantiate:

julia> @btime sum(Base.Broadcast.instantiate(Base.Broadcast.Broadcasted(*, (x, y))));
  31.200 μs (4 allocations: 96 bytes)

So we just need to wait for the PR to be merged in Julia (hopefully in time for 1.4), and we will be able to stop using dot. I guess if you want you can prepare a PR with a VERSION check, that will be needed anyway to avoid performance regression on older Julia versions.

nalimilan avatar Nov 08 '19 22:11 nalimilan

Sounds great! Regarding the PR, sure - I guess I'll wait until #31020 is merged though, so that CI tests can run.

oschulz avatar Nov 09 '19 07:11 oschulz

Actually I've just realized that another way of using pairwise summation would be to create a custom internal AbstractArray type for which getindex returns the product of the corresponding elements of the values and weights arrays. If you want to try that, it would be useful to support current Julia versions until we can simply use broadcasting (which should be strictly equivalent, but with less code).

nalimilan avatar Nov 09 '19 10:11 nalimilan

Oh yes - like a kind of specialized multi-input mapped-array, so to speak. Yes, that would work, I think. I'm a bit overloaded at the moment (some deadlines coming up), but I'll definitely put it on my to-do list.

oschulz avatar Nov 09 '19 20:11 oschulz

@oschulz https://github.com/JuliaLang/julia/pull/31020 has just been merged (at last!), so if you can give a try at a PR with if VERSION >= 1.5 that would be great.

nalimilan avatar May 02 '20 11:05 nalimilan

Oh, yes - will do!

oschulz avatar May 02 '20 12:05 oschulz

Will try do do it end of next week, I'm a bit overloaded until then.

oschulz avatar May 02 '20 12:05 oschulz