Statistics.jl
Statistics.jl copied to clipboard
Statistic.mean(f, A) calls f length(A)+1 times
Was expecting an iterator to only call the function once per element.
"""
Testing mean calls F length+1 times
"""
function meantest()
aa = [1, 3, 2, 7, 11]
bb = [1, 2, 3, 4, 5]
function meanf(a)
b = 1
d = mean(a) do v
r = v+b
println("$b $r $v")
b += 1
return r
end
return d
end
avg = sum(aa .+ bb) / length(aa)
avg2 = meanf(aa)
println(" $avg != $avg2 ")
end
julia> versioninfo() Julia Version 1.5.0-rc2.0 Commit 7f0ee122d7 (2020-07-27 15:24 UTC) Platform Info: OS: macOS (x86_64-apple-darwin19.5.0) CPU: Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-9.0.1 (ORCJIT, ivybridge) Environment: JULIA_NUM_THREADS = 8
Right, f
is called twice on the first element because of this line:
https://github.com/JuliaLang/Statistics.jl/blob/b384104d35ff0e7cf311485607b177223ed72b9a/src/Statistics.jl#L176
This behavior is new in Julia 1.5.0 (#25). But I'm not sure it's a problem: AFAIK we don't make any guarantees about how many times f
will be called. Also note that this happens only for arrays, not for general iterators.
FWIW, the previous code was this: https://github.com/JuliaLang/Statistics.jl/blob/a2203d3b67f7413701be5de251622cb85c9cc69d/src/Statistics.jl#L159
I'm not sure we can retain the new promotion behavior and at the same time avoid calling f
twice. Cc: @stevengj @kagalenko-m-b
We could do something like:
x1 = f(first(A)) / 1
result = x1 + sum(x -> _mean_promote(x1, f(x)), @view A[begin+1:end])
in the common case where A
is a 1d array and dims = 1
.
That looks reasonable, thanks. It's easy to work around but was a bit surprising. Guess I view an iterator with a visiting function to have an invariant of only calling the function once per element.
Also ran into this as I was surprised why my very expensive function was getting called N+1 times instead of N (which really mattered, since my N was small). Having that special case or some other workaround would be great.
Should be a straightforward PR to implement something like the suggestion above, if someone wants to take a crack at it?
x1 = f(first(A)) / 1
Unclear how to define first() when there's dims
argument.
I wrote:
in the common case where A is a 1d array and dims = 1.
It would be nice to at least fix this simple common case.
I must have overlooked the qualification because it feels unsatisfactory to me.
Interesting rabbit hole...
-
current code behaves like mapreduce with no expectation that f() will be called n times see mapreduce documentation. This was a surprise as I was expecting interation not a map reduce algorithm. Document or use first case of code below.
-
Union missing and abstract float arrays with dims selection seems like a case that should work. mean([1.f0 3.f1 2; 1.0 3 missing],dims=1) issue #7
-
@test mean(x) == sum(x) / length(x)
? shouldn't this be approx floating point compare?
@test (@inferred mean(Int[])) === 0/0
@test (@inferred mean(Float32[])) === 0.f0/0
@test (@inferred mean(Float64[])) === 0/0
seems like
typeof(mean(Float32[]) == typeof(0.f0/0)
is good enough I have no clue why the compiler is returning Any
Julia> @inferred (mean([1.f0 3.f1 2; 1.0 3 missing],dims=1))
ERROR: return type Matrix{Union{Missing, Float64}} does not match inferred return type Any
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] top-level scope
@ REPL[27]:1
julia> mean([1.f0 3.f1 2; 1.0 3 missing],dims=1)
1×3 Matrix{Union{Missing, Float64}}:
1.0 16.5 missing
Recommend case two to assist with issue #7. The first case this bug can be handled with documentation change.
Code: first case handles call f() once per element second case for : mean([1.f0 3.f1 2; 1.0 3 missing],dims=1) third case is original and handles type conversion from non-abstract floats but calls N+1 times
function _mean(f, A::AbstractArray, dims::Dims=:) where Dims
isempty(A) && return sum(f, A, dims=dims)/0
if dims === (:)
x1 = f(first(A)) / 1
n = length(A)
if n == 1
return x1
end
AA = @view A[2:end]
result = x1 + sum(x -> _mean_promote(x1, f(x)), AA, dims=dims)
return result / n
elseif typeof(first(A)) <: AbstractFloat
x1 = convert(typeof(first(A)), 0.f0)
n = mapreduce(i -> size(A, i), *, unique(dims); init=1)
result = sum(x -> _mean_promote(x1, f(x)), A, dims=dims)
return result ./= n
else # note: f() will be called n+1 times
x1 = f(first(A)) / 1
n = mapreduce(i -> size(A, i), *, unique(dims); init=1)
result = sum(x -> _mean_promote(x1, f(x)), A, dims=dims)
return result ./= n
end
end
(PS. please quote code in the future, @stefan911 … I edited your post to fix this.)
Thanks.
Since f() can change the return type it is difficult to remove the third case without unrolling into for loops.
x1 = f(first(A)) / 1
result = x1 + sum(x -> _mean_promote(x1, f(x)), @view A[begin+1:end])
That also fails on vectors with one element, so it should be
x1 = f(first(A)) / 1
result = sum(x -> _mean_promote(x1, f(x)), @view A[begin+1:end], init=x1)
Created PR
I think Base.first() should be amended to accept dims
argument. On the other hand, in multidimensional case performance impact of an extra call to f() is proportionally smaller.
Opinion: The general case of an iterator is to apply the function exactly once per element, to do otherwise should be documented in the help. foldr, foldl vs mapreduce. The expectation that f() is a pure function shouldn't be implied.
Did issue #7 get addressed as well?
As said elsewhere, it would be nice if Julia would develop a standard design pattern for this. In accumulation where there is no sensible typestable default value, you end up splitting off the first iteration, duplicate the entire loop body, and make a tiny change to it (eg += becomes =) and continue with a loop over the rest of the iterator.
Actually I think I've found a super simple solution which works in general and luckily doesn't reduce performance. See https://github.com/JuliaStats/Statistics.jl/pull/151.
EDIT: that's more complex than I thought. I don't have a good solution.
Not sure if it helps you but DNF gave me me a quirky snipped which initialises an accumulator in the loop in a way Julia can handle
function myfsum(f, xs)
isfirst = true
local y
for x in xs
if isfirst
isfirst, y = false, f(x)
else
y += f(x)
end
end
return y
end
julia> @time sum(sin, 1:10000)
0.000155 seconds
1.633891021792461
julia> @time myfsum(sin, 1:10000)
0.000130 seconds
1.6338910217924516
Unfortunately the difficulty comes from the fact that we do not control the loop as we want to use pairwise summation just like sum
. So we have to hack the mapreduce
machinery (or copy it but that's a lot of code duplication).