Statistics.jl
Statistics.jl copied to clipboard
Mean of an array with missing values does not work if the dims argument is provided
julia> using Statistics
julia> x = [(i==3 && j==3) ? missing : i*j for i in 1:3, j in 1:4]
3×4 Array{Union{Missing, Int64},2}:
1 2 3 4
2 4 6 8
3 6 missing 12
julia> mean(x)
missing
julia> mean(x; dims=1)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Float64
# SNIP
julia> mean(x; dims=2)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Float64
# SNIP
Expected is:
julia> mean(x; dims=1)
1×4 Array{Union{Missing,Float64},2}:
2.0 4.0 missing 8.0
julia> mean(x; dims=2)
3×1 Array{Union{Missing,Float64},2}:
2.5
5.0
missing
(credit to @nickrobinson251 who found this)
It appears to be because of this: https://github.com/JuliaLang/julia/blob/master/stdlib/Statistics/src/Statistics.jl#L134
julia> x = [(i == j == 3) ? missing : i * j for i in 1:3, j in 1:4]
3×4 Array{Union{Missing, Int64},2}:
1 2 3 4
2 4 6 8
3 6 missing 12
julia> Base.reducedim_init(t->t/2, +, x, 1)
1×4 Array{Float64,2}:
0.0 0.0 0.0 0.0
That gets passed to mean!, which tries to store the result in that array, but it can't because of the missing.
Going further, we get that result from reducedim_init because of this line: https://github.com/JuliaLang/julia/blob/master/base/reducedim.jl#L117, since zero(Union{Missing,Int}) === 0.
sum handles this just fine, so I think we should change mean to go though more of the sum machinery.
julia> using Statistics
julia> a = [1 2;
missing 3]
2×2 Array{Union{Missing, Int64},2}:
1 2
missing 3
julia> sum(skipmissing(a), dims=1)
ERROR: MethodError: no method matching sum(::Base.SkipMissing{Array{Union{Missing, Int64},2}}; dims=1)
Closest candidates are:
sum(::Any) at reduce.jl:503 got unsupported keyword argument "dims"
sum(::Any, ::AbstractArray; dims) at reducedim.jl:653
sum(::Any, ::Any) at reduce.jl:486 got unsupported keyword argument "dims"
...
Stacktrace:
[1] top-level scope at REPL[35]:1
sum also doesn't work now?
@Moelf sum(skipmissing(...), dims=...) has never worked. https://github.com/JuliaLang/julia/pull/28027/ implements it but it's not been merged (yet?).
This failure is also interesting:
julia> mean([missing 1; 3 4]; dims=1)
ERROR: InexactError: Int64(2.5)
It happens because we use the type of first(x)/1 to choose the eltype of the result, which in this case is Missing.
sumhandles this just fine, so I think we should changemeanto go though more of thesummachinery.
Actually, no, sum only handles this correctly in cases where Base.add_sum returns the same type as its inputs. But it fails when the type differs (which is precisely what mean does for integer inputs):
julia> sum([missing true; false false], dims=1)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Int64
The only way to fix this is to redesign the reducedim code, as _reducedim_init is just a hack when the type isn't concrete . AFAICT the only 100% correct solution is to do as with mapand broadcast: allocate an array using the type of the first element, and widen its eltype if necessary.
Any news/updates/new ideas on this?