Missings.jl
Missings.jl copied to clipboard
Add `spreadmissings`
Adds behavior described in #121
Implementation is easy enough to follow
function (f::SpreadMissings{F})(xs...) where {F}
@assert all(x -> x isa AbstractVector, xs)
x1 = first(xs)
@assert all(x -> length(x) == length(x1), xs)
if any(x -> x isa AbstractVector{>:Missing}, xs)
nonmissinginds = collect(eachindex(skipmissings(xs...)))
# TODO: narrow the type of the `view` to exclude `Missing`
res = f.f((view(x, nonmissinginds) for x in xs)...)
out = Vector{Union{Missing, eltype(res)}}(missing, length(x1))
out[nonmissinginds] .= res
return out
else
return f.f(xs...)
end
end
I just did some re-factoring. New implementation is
function (f::SpreadMissings{F})(xs...) where {F}
if any(x -> x isa AbstractVector{>:Missing}, xs)
vecs = Base.filter(x -> x isa AbstractVector, xs)
s = skipmissings(vecs...)
vecs_counter = 1
newargs = ntuple(length(xs)) do i
if xs[i] isa AbstractVector
t = s[vecs_counter]
vecs_counter += 1
else
t = xs[i]
end
t
end
res = f.f(newargs...)
if res isa AbstractVector
out = Vector{Union{Missing, eltype(res)}}(missing, length(first(vecs)))
out[collect(eachindex(first(s)))] .= res
else
out = Vector{Union{Missing, typeof(res)}}(missing, length(first(vecs)))
out[collect(eachindex(first(s)))] .= Ref(res)
end
return out
else
return f.f(xs...)
end
end
Miraculously, this infers function infers.
Okay after playing around with this here's what I think an API should be. Consider the call
f(v1::Vector, v2::Vector, s::Scalar)
Recall that skipmissings(x, y) returns a tuple of iterators sx, sy that iterate through x and y where both indices are not missing. We need
skipfun(f)such thatskipfun(f)(v1, v2, s)is
sv1, sv2 = skipmissings(v1, v2)
f(sv1, sv2, x)
It automatically finds the vector values and calls skipmissings on those. Maybe we can dig into the broadcasting infrastructure to make this work more intelligently?
spreadmissings, defined in this PR.spreadmisings(f)(v1, v2, s)is equivalent to
sv1, sv2 = skipmissings(v1, v2)
res = f(sv1, sv2, x)
if res isa AbstractVector
out = Vector{Union{Missing, eltype(res)}}(missing, length(v1))
out[collect(eachindex(sv1))] .= res
else
out = Vector{Union{Missing, typeof(res)}}(missing, length(v1))
out[collect(eachindex(sv1))] .= Ref(res)
end
out
The problem with spreadmissings is that it is restricted to Vectors and it's hard to figure out all the types that really need to be "spread". Consider, for example
function maketable(v1, v2, s)
# stuff which errors on missing values
(n1 = v1 .+ v2 .* s, n2 = v1 .- v2 .* s)
end
If we use spreadmissings logic, we know that this should be a named tuple of vectors each of length v1. Unfortunately we can't just special case a named tuple of vectors to be the kind of thing that "spreads". I mean, maybe we can, but it's easy to see how that kind of logic can get more and more complicated.
We also need
passmissingfalsewhich returnsfalseinstead ofmissing, for use in filtering and&&etc.spreadmissingfalsewhich fills infalseinstead ofmissingin the missing indices.
Maybe it would be more elegant to restrict it to functions for which all arguments are AbstractVector, i.e. @assert all(x -> x isa AbstractVector, xs).
AFAIU, this does not hurt usability in DataFrames since users need to create anonymous function with columns anyway. Same thing for DataFramesMeta.
Thats what I had initially. Then I thought about broadcasting and how it automatically figures out what should be broadcasted. So I updated this PR to emulate that behavior a little.
Another thing is that spreadmissing always returns a Vector, but some functions may return AbstractVectors that are not Vectors (e.g. . cut returns a CategoricalVector)
So it may be better to use in spreadmissing a function like extend_missings which extends an AbstractVector to accept missings, i.e.
function extend_missings(v::AbstractVector{T}, esample::BitVector)
out = Vector{Union{Missing, T}(missing, length(esample))
out[esample] = v
return out
end
This would allow external packages to define specialized methods of extend_missings for their particular type.
That's a good point. Maybe I will try to formalize all of this behavior into a separate package first and see how it goes.
It'd be great to have such a function! It's just hard to find the right abstraction, so it's good to experiment.
Requiring packages to extend extend_missings won't really work as they would all need to add a dependency on Missings, which most won't want to do. Why not just call allowmissing(res), which uses similar and should therefore return the right array type? for the case where a scalar is returned, similar(vecs[1], Union{Missing, typeof(res)}) shouldn't be too bad (at least it will return a CategroicalArray if res isa CategoricalValue).
which uses
similarand should therefore return the right array type? for the case where a scalar is returned,similar(vecs[1], Union{Missing, typeof(res)})shouldn't be too bad (at least it will return aCategroicalArrayifres isa CategoricalValue).
I changed it, but I don't think the implementation is quite correct. In particular, I'm worried about the scalar case. I'm not sure what the importance of the first argument in similar(x, eltype, length) is. What's the purpose of vecs[1] in similar(vec[1], Union{typeof(res), Missing})?
I think it would be totally awesome if @nalimilan or @quinnj took over this PR. It's over my head to deal with all the performance considerations and necessary benchmarking.
If possible maybe we can merge it as an undocumented feature and see if it works?
@nalimilan - please ping me when the first round of comments is resolved. Then I will review. Thank you!
Okay after the merging of @rtransform etc. in DataFramesMeta, this will be my next focus.
just to confirm, do we want this new feature to live in Missings.jl for now? Or do we want me to move all this stuff to DataFramesMeta.
Can you please summarize briefly the consensus how things should work? Thank you!
I think it would be good to have spreadmissings as an internal function in DataFramesMeta for a while until we are sure the API is what we need. Development can continue in this PR so that it can be merged later.
I think you forgot pushing. :-)
I think I pushed, then made comments after, so it looks like I forgot to push but I did. At least, the current state of the PR matches what's on my local computer.
Would love someone to take this over though.
I will have a look at it in April and then try to discuss with @nalimilan how to proceed. Thank you!
I went through the PR and thought a bit about it. I would propose - to make sure we can make some constructive progress on it:
- can we list functions which we think
spreadmissingswould be used with? - My intuition is that kwargs should be left untouched (they are usually passed as supplementary information to functions, eg. in plotting you can get a vector there that you do not want to be touched)
- for now I would assume that all arguments must be
AbstractVector. I know it is restrictive, but this makes sure that anything we add in the future to this function will be non-breaking. Also it seems for me (see point 1. and some discussion above) that this would be the most common use-case.
1. can we list functions which we think
spreadmissingswould be used with?
I think this function is well used with Statistical functions, like cor and cov. I know @nalimilan put a lot of effort into pairwise, but there are still some gaps, in particular using multiple arrays and getting cov to return a scalar rather than a matrix.
It is also useful for plotting functions. There was someone on Discourse yesterday who couldn't use UnicodePlots.jl because their vectors had missing values in them.
2. My intuition is that kwargs should be left untouched (they are usually passed as supplementary information to functions, eg. in plotting you can get a vector there that you do not want to be touched)
I disagree. For example, a weights keyword argument will need to be synced with the positional arguments. Besides, users can always use Ref to make them ignored.
3. for now I would assume that all arguments must be
AbstractVector. I know it is restrictive, but this makes sure that anything we add in the future to this function will be non-breaking. Also it seems for me (see point 1. and some discussion above) that this would be the most common use-case.
Maybe. I would say I'm okay with an initial version that left keyword arguments untouched and required vectors, but that would induce a breaking change if we were to modify keyword arguments too. I'm inclined to use implement current behavior and mark it as experimental.
Another issue is the "spread"-ing of the result. Currently, we always return an array, which would not work for cor and for plotting. I propose a keyword argument to spreadmissing(f; :spread = ...) where spread is
:vector: Spread out missing values if the returned object is anAbstractVectorand don't spread otherwise:none, never spread, this is useful if you have, saymodel_fit(x, y)which returns a Parameter vector which is not the same size asxandy.:all:Spread if there is a vector, and for any non-vector outputs, put them into avectorwithmissings filled in.
I have re-read the docstring:
will find the indices which corresond to
missingvalues in bothxandz. Then applyfon theviews ofxandzwhich contain non-missing values.
Given this I think we can only accept AbstractVector for skipping missing values. The reason is that you require the collection to be lineary indexable with a view and this is essentially AbstractVector.
Second is protection from spreading missing values - Ref is mentioned in your comment, but I think we need a different type for this, as user might pass Ref value and we could not distinguish it.
Finally I do not fully understand the spread argument of spreadmissing. I thought that we did not want spreadmissing to affect the output of the passed function (or am I missing something?). Do you mean that if I do spreadmissing(+) and you pass :vector then doing spreadmissing(+, spread=:vector)([1, 2, missing], [3, missing, 4]) would produce [4, missing, missing] and with :none would produce just [4]?
Given this I think we can only accept
AbstractVectorfor skipping missing values. The reason is that you require the collection to be lineary indexable with aviewand this is essentiallyAbstractVector.
Yes, we only apply the skipping to vectors. But other arguments are left untouched. Let's say there is a function normalize where the author unfortunately used too-strict a type signature
julia> function normalize(
m::Int,
x1::AbstractVector{<:Real},
x2::AbstractVector{<:Real})
min.(m, x1, x2)
end;
we should be able to do
julia> x = [1, 2, missing, 4]; y = [5, 6, 7, missing];
julia> spreadmissings(normalize)(1, x, y)
Note: Currently this doesn't work :( because we don't change the eltype of the view... Maybe @nalimilan knows if there is anything we can do to get around this.
Second is protection from spreading missing values -
Refis mentioned in your comment, but I think we need a different type for this, as user might passRefvalue and we could not distinguish it.
Okay. Not sure the best way to do this, then. defining a new protection type seems like overkill. Is there a standard way to do this?
Finally I do not fully understand the
spreadargument ofspreadmissing. I thought that we did not wantspreadmissingto affect the output of the passed function (or am I missing something?). Do you mean that if I dospreadmissing(+)and you pass:vectorthen doingspreadmissing(+, spread=:vector)([1, 2, missing], [3, missing, 4])would produce[4, missing, missing]and with:nonewould produce just[4]?
Yes, this is the behavior we want. It's really useful in tabular applications. Think of the normalize function above. If we didn't allow spreading, then you wouldn't be able to use it to work with data frames or any other table.
if there is anything we can do to get around this.
You would have to create your custom view type. This is what skipmissing does.
Is there a standard way to do this?
Standard way is to create a new type. The alternative would be to use a macro instead of a function and then we can protect with whatever we like (probably best to be consistent with DataFramesMeta.jl).
if there is anything we can do to get around this.
You would have to create your custom view type. This is what
skipmissingdoes.
I think there should be a way to circumvent the constructor of a SubArray to make the eltype do what we want. I will ask Slack.
I think there should be a way to circumvent the constructor of a SubArray to make the eltype do what we want.
Indeed maybe there would be a workaround. But then you need to decide if the resulting type should have continuous indexing, or like with skipmissing discontinuous indexing.
I want to use this function internally inside spreadmissings (or as something exposed to users if we think it's useful)
julia> using Missings
julia> function unsafe_nonmissingview(x::Vector, inds...)
SubArray{nonmissingtype(eltype(x)), 1, typeof(x), typeof(inds), false}(x, inds, 0, 0)
end;
julia> x = [100, 200, missing, 400];
julia> foo(x::AbstractVector{<:Real}) = 1;
julia> t = unsafe_nonmissingview(x, [1, 2, 4]);
julia> foo(t)
1
It's unsafe because you have to know that your indices don't have any missing values. It would be great to get a Base dev's opinion on this.
I think that for internal purposes it is OK.
Looking at the "spreadmissing problem" from a slightly different angle, this works and is more general in some directions:
julia> using Skipper, Accessors
# vector with missings:
julia> x = [1., 2, missing, 3, missing, 4]
# function that doesn't work with missings:
julia> normalize(x) = x ./ sum(x)
# apply normalize() to skip(ismissing, x), and write its results back:
julia> y = @modify(normalize, x |> skip(ismissing, _))
6-element Vector{Union{Missing, Float64}}:
0.1
0.2
missing
0.3
missing
0.4
If that's of interest, @modify(normalize, skipmissing(x)) can also be made to work in Accessors.
Alternatively, update the x vector inplace:
julia> sx = skip(ismissing, x)
julia> sx .= normalize(sx)
julia> x
6-element Vector{Union{Missing, Float64}}:
0.1
0.2
missing
0.3
missing
0.4
@aplavin - nice. How would it support multiple vectors? (also - if I understand this correctly @pdeffebach wants a function wrapper like passmissing)