Wrong weighted sum and mean results for complex numbers
Weighted mean gives wrong (conjugated) result for complex arrays: mean([1 + 0.5im], weights([1])) == 1.0 - 0.5im.
Indeed. That's because we use dot([1 + 0.5im], [1]) to compute the weighted sum (for performance reasons). I guess we could simply do dot([1], [1 + 0.5im]) since only the first argument is conjugated, and we know weights are real.
Would you make a pull request to apply that change (in weights.jl) and add a test?
Just ran into this in another context: #534 isn't quite the same, because it's about values that are vectors of real numbers, but the cause is the (recursive) behavior of dot(...) as well.
I think using dot in sum(A, w) is wrong, semantically. Could we change the definition of sum(A, w) to something like sum(A .* values(w)) or mapreduce(*, +, A, values(w))? I can do a PR, if there are no objections.
The reason we use dot AFAICT is that it's much more efficient thanks to BLAS. It's not an issue to abuse it if we are sure it's correct in all cases in which we use it. What we can do is to have a method using it for arrays of numbers, and a fall back method using a slower but generally correct approach.
Sounds good. Regarding on where do draw the line - BLAS would only accelerate Real, right (resp. do the wrong thing for Complex for this use case)?
So we'd do
Base.sum(v::AbstractArray{<:Real}, w::AbstractWeights) = dot(v, values(w))
Base.sum(v::AbstractArray, w::AbstractWeights) = ...
right? @nalimilan, what do you think would be best for the generic implementation:
# Option 1:
sum(A .* values(w))
# Option 2:
sum(Base.Broadcast.Broadcasted(*, (A, values(w))))
# Option 3:
mapreduce(*, +, A, values(w))
I've been thinking - maybe we should drop use of dot altogether, though, even if it means a performance loss, since we'd gain precision. AFAIK, Julia's sum() uses pairwise summation, to keep sums stable when summing over values with large differences in magnitude. But I don't think BLAS does fancy summing:
julia> A = Float32.(vcat(1.0, fill(1E-10, 10^6)))
julia> w = Weights(fill(Float32(1.0), length(A)))
julia> sum(A)
1.0000999f0
julia> sum(A, w), dot(A, values(w))
(1.0000969f0, 1.0000969f0)
julia> mapreduce(*, +, A, values(w))
1.0000999f0
Though I just saw that we're using a simple sum in StatsBase._moment2 instead of Welford's algorithm or similar - is that intentional? Statistics._varm uses centralize_sumabs2 for stability.
Options 2 and 3 are the best ones. Not sure how to choose between them, I guess benchmarking is needed.
mapreduce(*, +, A, values(w)) doesn't use pairwise summation either, so I think using BLAS is better. Can you check whether sum(Base.Broadcast.Broadcasted(*, (A, values(w)))) use pairwise summation?
IIRC we use Welford's algorithm when possible in var, but not always. Please file a separate issue if that's not the case.
mapreduce(*, +, A, values(w))doesn't use pairwise summation either [...]
Yes, it does (AFAIK the default Base.sum is actually defined via `mapreduce, which operates pairwise). Here's a more extreme example with a significant differenct in result:
julia> A = vcat([Float32(1E0)], fill(Float32(1E-8), 10^8));
julia> w = Weights(fill(Float32(1.0), length(A)));
julia> sum(A)
1.9999989f0
julia> mapreduce(*, +, A, w)
1.9999989f0
julia> sum(A, w)
1.9384409f0
julia> dot(A, values(w))
1.9384409f0
IIRC we use Welford's algorithm when possible in var, but not always. Please file a separate issue if that's not the case.
Oh, sure - I just stumbled over it while looking into sum and wasn't sure if it was intentional.
Actually mapreduce(*, +, A, w) allocates a copy to store the result, so that's not an option.
Actually mapreduce(*, +, A, w) allocates a copy to store the result, so that's not an option.
Oops, indeed - I naively thought it would be allocation-free. In that case, mapreduce is out, of course.
sum(Base.Broadcast.Broadcasted(*, (A, values(w)))) seems to be allocation-free, but it's fairly slow.
Ok, so maybe we do
Base.sum(v::AbstractArray{<:Real}, w::AbstractWeights) = dot(v, values(w))
Base.sum(v::AbstractArray, w::AbstractWeights) = sum(Base.Broadcast.Broadcasted(*, (A, values(w))))
for now, to get correct behavior in the general case without performance loss in the specific case (real numbers), and worry about precision improvement in a separate issue? Would that be fine with you, @nalimilan?
As I noted in my first comment, I think this will work:
Base.sum(v::AbstractArray{<:Complex}, w::AbstractWeights) = dot(w, v)
For the general case, it looks like broadcasting is slower than the dot fallback:
julia> x = rand(10_000);
julia> z = Vector{Union{Float64, Missing}}(x);
julia> @btime dot(x, z);
13.443 μs (1 allocation: 16 bytes)
julia> f(A, w) = sum(Base.Broadcast.Broadcasted(*, (A, w)))
f (generic function with 1 method)
julia> @btime f(x, z);
22.258 μs (11 allocations: 240 bytes)
Since we don't actually need the broadcasting features, I guess we should simply use a loop similar to dot, but without calling dot recursively:
function dot(x::AbstractArray, y::AbstractArray)
lx = length(x)
if lx != length(y)
throw(DimensionMismatch("first array has length $(lx) which does not match the length of the second, $(length(y))."))
end
if lx == 0
return dot(zero(eltype(x)), zero(eltype(y)))
end
s = zero(dot(first(x), first(y)))
for (Ix, Iy) in zip(eachindex(x), eachindex(y))
@inbounds s += dot(x[Ix], y[Iy])
end
s
end
I guess we should simply use a loop similar to dot, but without calling dot recursively
You mean something like this?
function dotnr_loop(x::AbstractArray, y::AbstractArray)
lx = length(x)
if lx != length(y)
throw(DimensionMismatch("first array has length $(lx) which does not match the length of the second, $(length(y))."))
end
if lx == 0
return zero(eltype(x)) * zero(eltype(y))
end
s = zero(first(x) * first(y))
for (Ix, Iy) in zip(eachindex(x), eachindex(y))
@inbounds s += x[Ix] * y[Iy]
end
s
end
For the general case, it looks like broadcasting is slower than the dot fallback
Sure, where dot can be used, we should use dot. But for the generic use case, it's not so clear to me that a hard-coded loop is superior to a broadcast-based implementation.
With my use case from #534 (computing the weighted center of a cluster of 3D-points), dotnr_loop is indeed somewhat faster than sum + broadcast (equivalent to reduce(+, ... + broadcast), but not dramatically:
julia> using StatsBase, Statistics, LinearAlgebra, StaticArrays, BenchmarkTools
julia> x = rand(SVector{3,Float64}, 10^3);
julia> y = rand(Float64, length(x));
julia> @benchmark dotnr_loop($x, $y)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.189 μs (0.00% GC)
median time: 1.190 μs (0.00% GC)
mean time: 1.203 μs (0.00% GC)
maximum time: 2.948 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> @benchmark sum(Base.Broadcast.Broadcasted(*, ($x, values($y))))
BenchmarkTools.Trial:
memory estimate: 48 bytes
allocs estimate: 2
--------------
minimum time: 1.510 μs (0.00% GC)
median time: 1.512 μs (0.00% GC)
mean time: 1.518 μs (0.00% GC)
maximum time: 3.178 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> @benchmark reduce(+, Base.Broadcast.Broadcasted(*, ($x, values($y))))
BenchmarkTools.Trial:
memory estimate: 48 bytes
allocs estimate: 2
--------------
minimum time: 1.511 μs (0.00% GC)
median time: 1.519 μs (0.00% GC)
mean time: 1.524 μs (0.00% GC)
maximum time: 2.611 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
On the other hand, reduce + broadcast is simpler, more generic and will also run on a GPU (currently needs explicit init element):
dotnr_bcast(x::AbstractArray, y::AbstractArray) = reduce(
+,
Base.Broadcast.Broadcasted(*, (x, values(y))),
init = zero(eltype(x)) * zero(eltype(y))
)
So when the data becomes larger
julia> using CuArrays
julia> x = rand(SVector{3,Float64}, 10^6);
julia> y = rand(Float64, length(x));
julia> cx = CuArray(x);
julia> cy = CuArray(y);
julia> CuArrays.allowscalar(false)
julia> @btime dotnr_loop($x, $y)
1.813 ms (0 allocations: 0 bytes)
3-element SArray{Tuple{3},Float64,1,3} with indices SOneTo(3):
249588.56051677305
250059.7208080707
249841.12107313547
julia> @btime dotnr_bcast($x, $y)
2.492 ms (2 allocations: 48 bytes)
3-element SArray{Tuple{3},Float64,1,3} with indices SOneTo(3):
249588.56051677305
250059.7208080707
249841.12107313547
julia> @btime dotnr_bcast($cx, $cy)
63.029 μs (91 allocations: 5.14 KiB)
3-element SArray{Tuple{3},Float64,1,3} with indices SOneTo(3):
249588.56051677145
250059.72080806308
249841.12107313433
then dotnr_loop with the explicit loop is still a bit faster than dotnr_bcast - but on the other hand, dotnr_bcast can run on a GPU and blow dotnr_loop out of the water.
So I wonder if we shouldn't go with the broadcast-based implementation, even if it's a bit slower on a CPU (at least for this use case it is). For eltype(x) <: Number we'd keep using dot of course, for full BLAS performance.
Long term, we should find an alternative to dot for the "simple numbers" case as well, though, and use a more stable summing algorithm (KBN or pairwise).
You mean something like this?
Yes.
On the other hand, reduce + broadcast is simpler, more generic and will also run on a GPU (currently needs explicit init element): ... then
dotnr_loopwith the explicit loop is still a bit faster thandotnr_bcast- but on the other hand,dotnr_bcastcan run on a GPU and blowdotnr_loopout of the water.
Tough call. I guess we should try to figure out with broadcast is slower than a loop on CPU. A similar alternative is using a generator, but in my tests it's also slower than a loop. Can you investigate this? This is of interest for Julia in general, not only for weighted sum, especially if a custom syntax is added for lazy broadcasting (https://github.com/JuliaLang/julia/pull/31088).
Long term, we should find an alternative to
dotfor the "simple numbers" case as well, though, and use a more stable summing algorithm (KBN or pairwise).
Actually sum on Broadcasted objects could easily use the same pairwise algorithm as for arrays. AFAICT that's what https://github.com/JuliaLang/julia/pull/31020 implements. It would be interesting to check what's the performance on that branch, if you feel like benchmarking it.
Can you investigate this? This is of interest for Julia in general, not only for weighted sum, especially if a custom syntax is added for lazy broadcasting
Well, I can try - but this may go fairly deep, we might need some code-dev experts there. :-) I assume some of the core (and not so core) developers are anyhow frequently thinking about how to improve broadcasting & friends. :-)
Actually sum on Broadcasted objects could easily use the same pairwise algorithm as for arrays.
I thought it did ... but apparently not (yet):
julia> A = vcat([Float32(1E0)], fill(Float32(1E-8), 10^8));
julia> w = Weights(fill(Float32(1.0), length(A)));
julia> reduce(+, broadcast(*, A, values(w)))
1.9999989f0
julia> reduce(+, Base.Broadcast.Broadcasted(*, (A, values(w))))
1.0f0
Oops. But as you say, maybe that's coming soon?
Oops. But as you say, maybe that's coming soon?
At least it's implemented in JuliaLang/julia#31020, so it's quite close.
OK, I've tested that branch, and summing Broadcasted is still slower. It seems that it's due to the fact that IndexStyle(::Broadcasted) = IndexCartesian(). Indeed, with a hack to force them to use IndexLinear, the speed is the same as sum, and reasonably close to dot:
julia> using BenchmarkTools
julia> x = rand(100_000);
julia> y = rand(100_000);
# This isn't correct in general, but OK for this simple case
julia> Base.LinearIndices(bc::Base.Broadcast.Broadcasted{<:Any}) = axes(bc)[1]
julia> @btime sum(x);
16.212 μs (1 allocation: 16 bytes)
julia> @btime Base._mapreduce(identity, Base.add_sum, IndexLinear(), Base.Broadcast.Broadcasted(identity, (x,)));
16.942 μs (3 allocations: 48 bytes)
julia> @btime Base._mapreduce(identity, Base.add_sum, IndexLinear(), Base.Broadcast.Broadcasted(*, (x, y)));
32.016 μs (3 allocations: 64 bytes)
julia> using LinearAlgebra
julia> @btime dot(x, y);
17.722 μs (1 allocation: 16 bytes)
Unfortunately this hack cannot be applied as-is, but it seems worth investigating whether we can ensure that Broadcasted(*, (x, y)) can use IndexLinear when both inputs are also IndexLinear. Then, combined with #31020, we could stop using dot.
The advantage of using broadcast + reduce, compared to a custom implementation, would be that we'll take advantage of future improvements automatically. :-)
Indeed, with a hack to force them to use IndexLinear, the speed is the same as sum, and reasonably close to dot
Nice!!
IndexStyle(::Broadcasted) = IndexCartesian()
I looked for that definition, but I can't find it - is it something implicit?
I tried to run your example, but I get
julia> @btime Base._mapreduce(identity, Base.add_sum, IndexLinear(), Base.Broadcast.Broadcasted(identity, (x,)));
ERROR: MethodError: no method matching _mapreduce(::typeof(identity), ::typeof(Base.add_sum), ::IndexLinear, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(identity),Tuple{Array{Float64,1}}})
Ah, let me guess, I need JuliaLang/julia#31020, of course?
Hm, we can't just override Base.LinearIndices(bc::Base.Broadcast.Broadcasted{<:Any}) though, can we? ;-) And using an internal function like _mapreduce should be avoided if possible, I guess.
But maybe we could use
Base.Broadcast.Broadcasted(*, (A, w))
instead of
Base.Broadcast.Broadcasted(*, (A, values(w)))
and then add a method to IndexStyle for the specific signature of Broadcasted with Weights in it (since StatsBase owns Weights)?
Or maybe it would be possible to do this in a generic way, so that IndexStyle(::Broadcasted{...}) where [...} = IndexLinear() in more cases (e.g. all args are AbstractVectors)?
Actually after reading more carefully https://github.com/JuliaLang/julia/pull/31020/files#r256266019 I realized it already works, one just needs to call instantiate:
julia> @btime sum(Base.Broadcast.instantiate(Base.Broadcast.Broadcasted(*, (x, y))));
31.200 μs (4 allocations: 96 bytes)
So we just need to wait for the PR to be merged in Julia (hopefully in time for 1.4), and we will be able to stop using dot. I guess if you want you can prepare a PR with a VERSION check, that will be needed anyway to avoid performance regression on older Julia versions.
Sounds great! Regarding the PR, sure - I guess I'll wait until #31020 is merged though, so that CI tests can run.
Actually I've just realized that another way of using pairwise summation would be to create a custom internal AbstractArray type for which getindex returns the product of the corresponding elements of the values and weights arrays. If you want to try that, it would be useful to support current Julia versions until we can simply use broadcasting (which should be strictly equivalent, but with less code).
Oh yes - like a kind of specialized multi-input mapped-array, so to speak. Yes, that would work, I think. I'm a bit overloaded at the moment (some deadlines coming up), but I'll definitely put it on my to-do list.
@oschulz https://github.com/JuliaLang/julia/pull/31020 has just been merged (at last!), so if you can give a try at a PR with if VERSION >= 1.5 that would be great.
Oh, yes - will do!
Will try do do it end of next week, I'm a bit overloaded until then.