Statistics.jl icon indicating copy to clipboard operation
Statistics.jl copied to clipboard

Statistic.mean(f, A) calls f length(A)+1 times

Open stefan911 opened this issue 4 years ago • 19 comments

Was expecting an iterator to only call the function once per element.

"""
Testing mean calls F length+1 times
"""
function meantest()
    aa = [1, 3, 2, 7, 11]
    bb = [1, 2, 3, 4, 5]


    function meanf(a)
        b = 1
        d = mean(a) do v
            r = v+b
            println("$b $r $v")
            b += 1
            return r
        end
        return d
    end

    avg = sum(aa .+ bb) / length(aa)
    avg2 = meanf(aa)
    println(" $avg != $avg2 ")
end


julia> versioninfo() Julia Version 1.5.0-rc2.0 Commit 7f0ee122d7 (2020-07-27 15:24 UTC) Platform Info: OS: macOS (x86_64-apple-darwin19.5.0) CPU: Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-9.0.1 (ORCJIT, ivybridge) Environment: JULIA_NUM_THREADS = 8

stefan911 avatar Aug 05 '20 16:08 stefan911

Right, f is called twice on the first element because of this line: https://github.com/JuliaLang/Statistics.jl/blob/b384104d35ff0e7cf311485607b177223ed72b9a/src/Statistics.jl#L176

This behavior is new in Julia 1.5.0 (#25). But I'm not sure it's a problem: AFAIK we don't make any guarantees about how many times f will be called. Also note that this happens only for arrays, not for general iterators.

FWIW, the previous code was this: https://github.com/JuliaLang/Statistics.jl/blob/a2203d3b67f7413701be5de251622cb85c9cc69d/src/Statistics.jl#L159

I'm not sure we can retain the new promotion behavior and at the same time avoid calling f twice. Cc: @stevengj @kagalenko-m-b

nalimilan avatar Aug 09 '20 11:08 nalimilan

We could do something like:

x1 = f(first(A)) / 1
result = x1 + sum(x -> _mean_promote(x1, f(x)), @view A[begin+1:end])

in the common case where A is a 1d array and dims = 1.

stevengj avatar Aug 09 '20 12:08 stevengj

That looks reasonable, thanks. It's easy to work around but was a bit surprising. Guess I view an iterator with a visiting function to have an invariant of only calling the function once per element.

stefan911 avatar Aug 12 '20 19:08 stefan911

Also ran into this as I was surprised why my very expensive function was getting called N+1 times instead of N (which really mattered, since my N was small). Having that special case or some other workaround would be great.

marius311 avatar Jul 04 '21 02:07 marius311

Should be a straightforward PR to implement something like the suggestion above, if someone wants to take a crack at it?

stevengj avatar Jul 05 '21 16:07 stevengj

x1 = f(first(A)) / 1

Unclear how to define first() when there's dims argument.

kagalenko-m-b avatar Jul 06 '21 16:07 kagalenko-m-b

I wrote:

in the common case where A is a 1d array and dims = 1.

It would be nice to at least fix this simple common case.

stevengj avatar Jul 06 '21 17:07 stevengj

I must have overlooked the qualification because it feels unsatisfactory to me.

kagalenko-m-b avatar Jul 06 '21 17:07 kagalenko-m-b

Interesting rabbit hole...

  1. current code behaves like mapreduce with no expectation that f() will be called n times see mapreduce documentation. This was a surprise as I was expecting interation not a map reduce algorithm. Document or use first case of code below.

  2. Union missing and abstract float arrays with dims selection seems like a case that should work. mean([1.f0 3.f1 2; 1.0 3 missing],dims=1) issue #7

  3. @test mean(x) == sum(x) / length(x) ? shouldn't this be approx floating point compare?

@test (@inferred mean(Int[])) === 0/0
    @test (@inferred mean(Float32[])) === 0.f0/0    
    @test (@inferred mean(Float64[])) === 0/0

seems like

typeof(mean(Float32[]) == typeof(0.f0/0)

is good enough I have no clue why the compiler is returning Any

Julia> @inferred (mean([1.f0 3.f1 2; 1.0 3 missing],dims=1))
ERROR: return type Matrix{Union{Missing, Float64}} does not match inferred return type Any
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] top-level scope
   @ REPL[27]:1

julia> mean([1.f0 3.f1 2; 1.0 3 missing],dims=1)
1×3 Matrix{Union{Missing, Float64}}:
 1.0  16.5  missing

Recommend case two to assist with issue #7. The first case this bug can be handled with documentation change.

Code: first case handles call f() once per element second case for : mean([1.f0 3.f1 2; 1.0 3 missing],dims=1) third case is original and handles type conversion from non-abstract floats but calls N+1 times

function _mean(f, A::AbstractArray, dims::Dims=:) where Dims
    isempty(A) && return sum(f, A, dims=dims)/0

    if dims === (:)
        x1 = f(first(A)) / 1
        n = length(A)
        if n == 1
            return x1
        end
        AA = @view A[2:end]
        result = x1 + sum(x -> _mean_promote(x1, f(x)), AA, dims=dims)
        return result / n
    elseif typeof(first(A)) <: AbstractFloat
        x1 = convert(typeof(first(A)), 0.f0)
        n = mapreduce(i -> size(A, i), *, unique(dims); init=1)
        result = sum(x -> _mean_promote(x1, f(x)), A, dims=dims)
        return result ./= n
    else # note: f() will be called n+1 times
        x1 = f(first(A)) / 1
        n = mapreduce(i -> size(A, i), *, unique(dims); init=1)
        result = sum(x -> _mean_promote(x1, f(x)), A, dims=dims)
        return result ./= n
    end
end

stefan911 avatar Jul 06 '21 18:07 stefan911

(PS. please quote code in the future, @stefan911 … I edited your post to fix this.)

stevengj avatar Jul 06 '21 18:07 stevengj

Thanks.

Since f() can change the return type it is difficult to remove the third case without unrolling into for loops.

stefan911 avatar Jul 06 '21 18:07 stefan911

x1 = f(first(A)) / 1 result = x1 + sum(x -> _mean_promote(x1, f(x)), @view A[begin+1:end])

That also fails on vectors with one element, so it should be

 x1 = f(first(A)) / 1
result = sum(x -> _mean_promote(x1, f(x)), @view A[begin+1:end], init=x1)

kagalenko-m-b avatar Jul 06 '21 19:07 kagalenko-m-b

Created PR

I think Base.first() should be amended to accept dims argument. On the other hand, in multidimensional case performance impact of an extra call to f() is proportionally smaller.

kagalenko-m-b avatar Jul 07 '21 10:07 kagalenko-m-b

Opinion: The general case of an iterator is to apply the function exactly once per element, to do otherwise should be documented in the help. foldr, foldl vs mapreduce. The expectation that f() is a pure function shouldn't be implied.

Did issue #7 get addressed as well?

stefan911 avatar Jul 07 '21 13:07 stefan911

Did issue #7 get addressed as well?

That probably should be fixed in Base.

kagalenko-m-b avatar Jul 07 '21 13:07 kagalenko-m-b

As said elsewhere, it would be nice if Julia would develop a standard design pattern for this. In accumulation where there is no sensible typestable default value, you end up splitting off the first iteration, duplicate the entire loop body, and make a tiny change to it (eg += becomes =) and continue with a loop over the rest of the iterator.

mschauer avatar Aug 11 '23 07:08 mschauer

Actually I think I've found a super simple solution which works in general and luckily doesn't reduce performance. See https://github.com/JuliaStats/Statistics.jl/pull/151.

EDIT: that's more complex than I thought. I don't have a good solution.

nalimilan avatar Sep 09 '23 13:09 nalimilan

Not sure if it helps you but DNF gave me me a quirky snipped which initialises an accumulator in the loop in a way Julia can handle

function myfsum(f, xs)
   isfirst = true
   local y
   for x in xs
        if isfirst
            isfirst, y = false, f(x)
        else
            y += f(x)
        end
    end
    return y
end
julia> @time sum(sin, 1:10000)
  0.000155 seconds
1.633891021792461

julia> @time myfsum(sin, 1:10000)
  0.000130 seconds
1.6338910217924516

mschauer avatar Sep 09 '23 15:09 mschauer

Unfortunately the difficulty comes from the fact that we do not control the loop as we want to use pairwise summation just like sum. So we have to hack the mapreduce machinery (or copy it but that's a lot of code duplication).

nalimilan avatar Sep 09 '23 15:09 nalimilan