SplitApplyCombine.jl
SplitApplyCombine.jl copied to clipboard
Not sure how best to handle `missing`
I'm not actually sure where the right place on the stack is to fix this, because it seems to cut across several layers.
Here's an example - Say I want to get the mean of each group, ignoring missings. Notice that for the foo=2
group, both elements are missing
.
using SplitApplyCombine
table = [(foo=1, bar=rand()),
(foo=2, bar=missing),
(foo=3, bar=rand()),
(foo=1, bar=missing),
(foo=2, bar=missing),
(foo=3, bar=rand())]
map(mean ∘ skipmissing, group(r->r.foo, r->r.bar, table))
This throws the error MethodError: no method matching zero(::Type{Any})
This is the result of a cascade of things, most of which seem pretty reasonable in isolation, which is why it's not clear (to me anyways) what the right fix is
-
mean
doesn't know how to handle an empty arrayAny[]
. I don't think there's anything more reasonable formean
to do here -
table
doesn't have usful type information (see https://github.com/JuliaLang/julia/issues/31077) -
group
seems to set the type of the dictionary elements based on theeltype
oftable
.
I'm not sure if there's a good resolution to this. Even if group
built up the groups iteratively rather than pre-allocating, for a group with only missings it would end up with an Array{Missing}
, which still doesn't help mean
figure out what a reasonable answer is.
My current workaround is to re-inject the type information, but it took some digging to figure out what the actual problem was, and is not pretty:
map(c->mean(Vector{Float64}(collect(skipmissing(c)))), group(r->r.foo, r->r.bar, table))
Another workaround is setting the type of table
explicitly:
table = NamedTuple{(:foo, :bar), Tuple{Int64, Union{Missing,Float64}}}[
(foo=1, bar=rand()),
...
But that gets pretty verbose.
Any thoughts as the the best way to handle this?
Yes - you are completely right. This is difficult. And the difficulty arises from the interaction of lots of little different aspects of the system, from the types, inference, standard library/Base.missing
, not paying special attention to missing
in SplitApplyCombine
, lack of composiblity Statistics.mean
when you want to skip missing values, etc.
If you are looking for a quick solution, you can try:
julia> (((count, sum),) -> sum / count).(groupreduce(r->r.foo, r->coalesce(r.bar, 0.0), ((count, sum), bar) -> (count + 1, sum + bar), table, init = (0, 0.0)))
3-element Dictionaries.HashDictionary{Any,Any}
2 │ 0.0
3 │ 0.567196074236143
1 │ 0.04591996082297756
It's not particularly elegant though! And it's still slow due to type instability of table
. You might be able to fix that with:
julia> table = NamedTuple{(:foo, :bar), Tuple{Int, Union{Missing, Float64}}}[(foo=1, bar=rand()),
(foo=2, bar=missing),
(foo=3, bar=rand()),
(foo=1, bar=missing),
(foo=2, bar=missing),
(foo=3, bar=rand())]
6-element Array{NamedTuple{(:foo, :bar),Tuple{Int64,Union{Missing, Float64}}},1}:
NamedTuple{(:foo, :bar),Tuple{Int64,Union{Missing, Float64}}}((1, 0.07490415420165197))
NamedTuple{(:foo, :bar),Tuple{Int64,Union{Missing, Float64}}}((2, missing))
NamedTuple{(:foo, :bar),Tuple{Int64,Union{Missing, Float64}}}((3, 0.1651018454906743))
NamedTuple{(:foo, :bar),Tuple{Int64,Union{Missing, Float64}}}((1, missing))
NamedTuple{(:foo, :bar),Tuple{Int64,Union{Missing, Float64}}}((2, missing))
NamedTuple{(:foo, :bar),Tuple{Int64,Union{Missing, Float64}}}((3, 0.23729296064354855))
but that's not reasonable and it's very much not recommended for large named tuples of with more than one or two Union{T, missing}
fields.
Any thoughts as the the best way to handle this?
I'm not certain the best way we can improve this on the library side.
Well, I suppose that SplitApplyCombine
functions could implement type widening when growing collections. group
and so-on could have an in-built skipmissing
flag. I dunno - there's still multiple challenges outside of SplitApplyCombine
that could be resolved, too, so that this can remain a simple, composable library.
Well, I suppose that
SplitApplyCombine
functions could implement type widening when growing collections.
I think it should be easy once you have mutate-or-widen interface for Dictionaries.jl. Here is an example with plain Dict
:
using BangBang
using BangBang.Experimental: modify!!
using BangBang.NoBang: SingletonVector
using InitialValues
using InitialValues: InitialValue
function groupreduce_bb(by, f, op, itr; init = Init(op))
acc = foldl(itr; init = Dict{Union{},Union{}}()) do acc, x
acc, _ = modify!!(acc, by(x)) do iacc
Some(op(something(iacc, init), f(x)))
end
return acc
end
if init isa InitialValue && InitialValue <: valtype(acc)
return Dict(k => v for (k, v) in acc if v !== init)
else
return acc
end
end
group_bb(by, f, itr) =
groupreduce_bb(by, x -> SingletonVector(f(x)), append!!, table; init = Init(append!!))
julia> group_bb(r->r.foo, r->r.bar, table)
Dict{Int64,AbstractArray{T,1} where T} with 3 entries:
2 => [missing, missing]
3 => [0.0756208, 0.745847]
1 => Union{Missing, Float64}[0.37734, missing]
(OK, "easy" in the sense it's possible now that I fixed a bug https://github.com/tkf/BangBang.jl/pull/145)
Transducers.jl uses a similar strategy but in a more thread-friendly way.
group
and so-on could have an in-builtskipmissing
flag.
I don't think adding a flag is the best strategy in terms of composability. I think this is where you really need transducers. SplitApplyCombine.groupreduce
kind of already works with Transducers.jl but it's a bit tricky to do this for now:
julia> using SplitApplyCombine
using Transducers
using OnlineStats: Mean
julia> rf0 = reducingfunction(Mean())
rf = reducingfunction(NotA(Missing), rf0)
groupreduce(r->r.foo, r->r.bar, rf, table; init = Init(rf0))
3-element Dictionaries.HashDictionary{Any,Union{InitialValues.InitialValueOf{Transducers.OnlineStatReducingFunction{Mean{Float64,OnlineStatsBase.EqualWeight}}}, Mean{Float64,OnlineStatsBase.EqualWeight}}}
2 │ Init(::Transducers.OnlineStatReducingFunction{Mean{Float64,OnlineStatsBase.EqualWeight}})
3 │ Mean: n=2 | value=0.410734
1 │ Mean: n=1 | value=0.37734
I think it should be possible to make it work more easily, without adding Transducers.jl-aware code in SplitApplyCombine.jl, if groupreduce
is implemented with the widening strategy. The integration can be much smoother if you use InitialValues.jl, though.