Missings.jl icon indicating copy to clipboard operation
Missings.jl copied to clipboard

Add `spreadmissings`

Open pdeffebach opened this issue 5 years ago • 69 comments

Adds behavior described in #121

Implementation is easy enough to follow

function (f::SpreadMissings{F})(xs...) where {F}
    @assert all(x -> x isa AbstractVector, xs)

    x1 = first(xs)
    @assert all(x -> length(x) == length(x1), xs)

    if any(x -> x isa AbstractVector{>:Missing}, xs)
        nonmissinginds = collect(eachindex(skipmissings(xs...)))
        # TODO: narrow the type of the `view` to exclude `Missing`
        res = f.f((view(x, nonmissinginds) for x in xs)...)

        out = Vector{Union{Missing, eltype(res)}}(missing, length(x1))
        out[nonmissinginds] .= res

        return out
    else
        return f.f(xs...)
    end
end

pdeffebach avatar Oct 31 '20 17:10 pdeffebach

I just did some re-factoring. New implementation is

function (f::SpreadMissings{F})(xs...) where {F}
    if any(x -> x isa AbstractVector{>:Missing}, xs)
        vecs = Base.filter(x -> x isa AbstractVector, xs)
        s = skipmissings(vecs...)
        vecs_counter = 1
        newargs = ntuple(length(xs)) do i
            if xs[i] isa AbstractVector
                t = s[vecs_counter]
                vecs_counter += 1
            else
                t = xs[i]
            end
            t
        end

        res = f.f(newargs...)

        if res isa AbstractVector
            out = Vector{Union{Missing, eltype(res)}}(missing, length(first(vecs)))
            out[collect(eachindex(first(s)))] .= res
        else
            out = Vector{Union{Missing, typeof(res)}}(missing, length(first(vecs)))
            out[collect(eachindex(first(s)))] .= Ref(res)
        end

        return out
    else
        return f.f(xs...)
    end
end

Miraculously, this infers function infers.

pdeffebach avatar Nov 05 '20 01:11 pdeffebach

Okay after playing around with this here's what I think an API should be. Consider the call

f(v1::Vector, v2::Vector, s::Scalar)

Recall that skipmissings(x, y) returns a tuple of iterators sx, sy that iterate through x and y where both indices are not missing. We need

  1. skipfun(f) such that skipfun(f)(v1, v2, s) is
sv1, sv2 = skipmissings(v1, v2)
f(sv1, sv2, x)

It automatically finds the vector values and calls skipmissings on those. Maybe we can dig into the broadcasting infrastructure to make this work more intelligently?

  1. spreadmissings, defined in this PR. spreadmisings(f)(v1, v2, s) is equivalent to
sv1, sv2 = skipmissings(v1, v2)
res = f(sv1, sv2, x)
if res isa AbstractVector
    out = Vector{Union{Missing, eltype(res)}}(missing, length(v1))
    out[collect(eachindex(sv1))] .= res
else
    out = Vector{Union{Missing, typeof(res)}}(missing, length(v1))
    out[collect(eachindex(sv1))] .= Ref(res)
end
out

The problem with spreadmissings is that it is restricted to Vectors and it's hard to figure out all the types that really need to be "spread". Consider, for example

function maketable(v1, v2, s)
    # stuff which errors on missing values
    (n1 = v1 .+ v2 .* s, n2 = v1 .- v2 .* s)
end

If we use spreadmissings logic, we know that this should be a named tuple of vectors each of length v1. Unfortunately we can't just special case a named tuple of vectors to be the kind of thing that "spreads". I mean, maybe we can, but it's easy to see how that kind of logic can get more and more complicated.

pdeffebach avatar Nov 05 '20 02:11 pdeffebach

We also need

  1. passmissingfalse which returns false instead of missing, for use in filtering and && etc.
  2. spreadmissingfalse which fills in false instead of missing in the missing indices.

pdeffebach avatar Nov 05 '20 20:11 pdeffebach

Maybe it would be more elegant to restrict it to functions for which all arguments are AbstractVector, i.e. @assert all(x -> x isa AbstractVector, xs). AFAIU, this does not hurt usability in DataFrames since users need to create anonymous function with columns anyway. Same thing for DataFramesMeta.

matthieugomez avatar Nov 07 '20 18:11 matthieugomez

Thats what I had initially. Then I thought about broadcasting and how it automatically figures out what should be broadcasted. So I updated this PR to emulate that behavior a little.

pdeffebach avatar Nov 07 '20 18:11 pdeffebach

Another thing is that spreadmissing always returns a Vector, but some functions may return AbstractVectors that are not Vectors (e.g. . cut returns a CategoricalVector)

So it may be better to use in spreadmissing a function like extend_missings which extends an AbstractVector to accept missings, i.e.

function extend_missings(v::AbstractVector{T}, esample::BitVector)
   out = Vector{Union{Missing, T}(missing, length(esample))
   out[esample] = v
   return out
end

This would allow external packages to define specialized methods of extend_missings for their particular type.

matthieugomez avatar Nov 07 '20 18:11 matthieugomez

That's a good point. Maybe I will try to formalize all of this behavior into a separate package first and see how it goes.

pdeffebach avatar Nov 07 '20 19:11 pdeffebach

It'd be great to have such a function! It's just hard to find the right abstraction, so it's good to experiment.

matthieugomez avatar Nov 07 '20 19:11 matthieugomez

Requiring packages to extend extend_missings won't really work as they would all need to add a dependency on Missings, which most won't want to do. Why not just call allowmissing(res), which uses similar and should therefore return the right array type? for the case where a scalar is returned, similar(vecs[1], Union{Missing, typeof(res)}) shouldn't be too bad (at least it will return a CategroicalArray if res isa CategoricalValue).

nalimilan avatar Nov 09 '20 09:11 nalimilan

which uses similar and should therefore return the right array type? for the case where a scalar is returned, similar(vecs[1], Union{Missing, typeof(res)}) shouldn't be too bad (at least it will return a CategroicalArray if res isa CategoricalValue).

I changed it, but I don't think the implementation is quite correct. In particular, I'm worried about the scalar case. I'm not sure what the importance of the first argument in similar(x, eltype, length) is. What's the purpose of vecs[1] in similar(vec[1], Union{typeof(res), Missing})?

pdeffebach avatar Nov 30 '20 01:11 pdeffebach

I think it would be totally awesome if @nalimilan or @quinnj took over this PR. It's over my head to deal with all the performance considerations and necessary benchmarking.

If possible maybe we can merge it as an undocumented feature and see if it works?

pdeffebach avatar Apr 25 '21 14:04 pdeffebach

@nalimilan - please ping me when the first round of comments is resolved. Then I will review. Thank you!

bkamins avatar Jun 28 '21 07:06 bkamins

Okay after the merging of @rtransform etc. in DataFramesMeta, this will be my next focus.

just to confirm, do we want this new feature to live in Missings.jl for now? Or do we want me to move all this stuff to DataFramesMeta.

pdeffebach avatar Jul 25 '21 14:07 pdeffebach

Can you please summarize briefly the consensus how things should work? Thank you!

bkamins avatar Jul 25 '21 18:07 bkamins

I think it would be good to have spreadmissings as an internal function in DataFramesMeta for a while until we are sure the API is what we need. Development can continue in this PR so that it can be merged later.

nalimilan avatar Jul 25 '21 19:07 nalimilan

I think you forgot pushing. :-)

nalimilan avatar Dec 10 '21 08:12 nalimilan

I think I pushed, then made comments after, so it looks like I forgot to push but I did. At least, the current state of the PR matches what's on my local computer.

Would love someone to take this over though.

pdeffebach avatar Apr 06 '22 13:04 pdeffebach

I will have a look at it in April and then try to discuss with @nalimilan how to proceed. Thank you!

bkamins avatar Apr 06 '22 21:04 bkamins

I went through the PR and thought a bit about it. I would propose - to make sure we can make some constructive progress on it:

  1. can we list functions which we think spreadmissings would be used with?
  2. My intuition is that kwargs should be left untouched (they are usually passed as supplementary information to functions, eg. in plotting you can get a vector there that you do not want to be touched)
  3. for now I would assume that all arguments must be AbstractVector. I know it is restrictive, but this makes sure that anything we add in the future to this function will be non-breaking. Also it seems for me (see point 1. and some discussion above) that this would be the most common use-case.

bkamins avatar Jun 15 '22 21:06 bkamins

1. can we list functions which we think spreadmissings would be used with?

I think this function is well used with Statistical functions, like cor and cov. I know @nalimilan put a lot of effort into pairwise, but there are still some gaps, in particular using multiple arrays and getting cov to return a scalar rather than a matrix.

It is also useful for plotting functions. There was someone on Discourse yesterday who couldn't use UnicodePlots.jl because their vectors had missing values in them.

2. My intuition is that kwargs should be left untouched (they are usually passed as supplementary information to functions, eg. in plotting you can get a vector there that you do not want to be touched)

I disagree. For example, a weights keyword argument will need to be synced with the positional arguments. Besides, users can always use Ref to make them ignored.

3. for now I would assume that all arguments must be AbstractVector. I know it is restrictive, but this makes sure that anything we add in the future to this function will be non-breaking. Also it seems for me (see point 1. and some discussion above) that this would be the most common use-case.

Maybe. I would say I'm okay with an initial version that left keyword arguments untouched and required vectors, but that would induce a breaking change if we were to modify keyword arguments too. I'm inclined to use implement current behavior and mark it as experimental.

Another issue is the "spread"-ing of the result. Currently, we always return an array, which would not work for cor and for plotting. I propose a keyword argument to spreadmissing(f; :spread = ...) where spread is

  • :vector: Spread out missing values if the returned object is an AbstractVector and don't spread otherwise
  • :none, never spread, this is useful if you have, say model_fit(x, y) which returns a Parameter vector which is not the same size as x and y.
  • :all: Spread if there is a vector, and for any non-vector outputs, put them into a vector with missings filled in.

pdeffebach avatar Feb 09 '23 20:02 pdeffebach

I have re-read the docstring:

will find the indices which corresond to missing values in both x and z. Then apply f on the views of x and z which contain non-missing values.

Given this I think we can only accept AbstractVector for skipping missing values. The reason is that you require the collection to be lineary indexable with a view and this is essentially AbstractVector.

Second is protection from spreading missing values - Ref is mentioned in your comment, but I think we need a different type for this, as user might pass Ref value and we could not distinguish it.

Finally I do not fully understand the spread argument of spreadmissing. I thought that we did not want spreadmissing to affect the output of the passed function (or am I missing something?). Do you mean that if I do spreadmissing(+) and you pass :vector then doing spreadmissing(+, spread=:vector)([1, 2, missing], [3, missing, 4]) would produce [4, missing, missing] and with :none would produce just [4]?

bkamins avatar Feb 11 '23 10:02 bkamins

Given this I think we can only accept AbstractVector for skipping missing values. The reason is that you require the collection to be lineary indexable with a view and this is essentially AbstractVector.

Yes, we only apply the skipping to vectors. But other arguments are left untouched. Let's say there is a function normalize where the author unfortunately used too-strict a type signature

julia> function normalize(
           m::Int, 
           x1::AbstractVector{<:Real}, 
           x2::AbstractVector{<:Real})
           
           min.(m, x1, x2)
       end;

we should be able to do

julia> x = [1, 2, missing, 4]; y = [5, 6, 7, missing];

julia> spreadmissings(normalize)(1, x, y)

Note: Currently this doesn't work :( because we don't change the eltype of the view... Maybe @nalimilan knows if there is anything we can do to get around this.

Second is protection from spreading missing values - Ref is mentioned in your comment, but I think we need a different type for this, as user might pass Ref value and we could not distinguish it.

Okay. Not sure the best way to do this, then. defining a new protection type seems like overkill. Is there a standard way to do this?

Finally I do not fully understand the spread argument of spreadmissing. I thought that we did not want spreadmissing to affect the output of the passed function (or am I missing something?). Do you mean that if I do spreadmissing(+) and you pass :vector then doing spreadmissing(+, spread=:vector)([1, 2, missing], [3, missing, 4]) would produce [4, missing, missing] and with :none would produce just [4]?

Yes, this is the behavior we want. It's really useful in tabular applications. Think of the normalize function above. If we didn't allow spreading, then you wouldn't be able to use it to work with data frames or any other table.

pdeffebach avatar Feb 12 '23 17:02 pdeffebach

if there is anything we can do to get around this.

You would have to create your custom view type. This is what skipmissing does.

bkamins avatar Feb 12 '23 18:02 bkamins

Is there a standard way to do this?

Standard way is to create a new type. The alternative would be to use a macro instead of a function and then we can protect with whatever we like (probably best to be consistent with DataFramesMeta.jl).

bkamins avatar Feb 12 '23 18:02 bkamins

if there is anything we can do to get around this.

You would have to create your custom view type. This is what skipmissing does.

I think there should be a way to circumvent the constructor of a SubArray to make the eltype do what we want. I will ask Slack.

pdeffebach avatar Feb 12 '23 18:02 pdeffebach

I think there should be a way to circumvent the constructor of a SubArray to make the eltype do what we want.

Indeed maybe there would be a workaround. But then you need to decide if the resulting type should have continuous indexing, or like with skipmissing discontinuous indexing.

bkamins avatar Feb 12 '23 18:02 bkamins

I want to use this function internally inside spreadmissings (or as something exposed to users if we think it's useful)

julia> using Missings

julia> function unsafe_nonmissingview(x::Vector, inds...)
           SubArray{nonmissingtype(eltype(x)), 1, typeof(x), typeof(inds), false}(x, inds, 0, 0)
       end;

julia> x = [100, 200, missing, 400];

julia> foo(x::AbstractVector{<:Real}) = 1;

julia> t = unsafe_nonmissingview(x, [1, 2, 4]);

julia> foo(t)
1

It's unsafe because you have to know that your indices don't have any missing values. It would be great to get a Base dev's opinion on this.

pdeffebach avatar Feb 13 '23 01:02 pdeffebach

I think that for internal purposes it is OK.

bkamins avatar Feb 13 '23 08:02 bkamins

Looking at the "spreadmissing problem" from a slightly different angle, this works and is more general in some directions:

julia> using Skipper, Accessors

# vector with missings:
julia> x = [1., 2, missing, 3, missing, 4]

# function that doesn't work with missings:
julia> normalize(x) = x ./ sum(x)

# apply normalize() to skip(ismissing, x), and write its results back:
julia> y = @modify(normalize, x |> skip(ismissing, _))
6-element Vector{Union{Missing, Float64}}:
 0.1
 0.2
  missing
 0.3
  missing
 0.4

If that's of interest, @modify(normalize, skipmissing(x)) can also be made to work in Accessors.

Alternatively, update the x vector inplace:

julia> sx = skip(ismissing, x)
julia> sx .= normalize(sx)
julia> x
6-element Vector{Union{Missing, Float64}}:
 0.1
 0.2
  missing
 0.3
  missing
 0.4

aplavin avatar Feb 13 '23 08:02 aplavin

@aplavin - nice. How would it support multiple vectors? (also - if I understand this correctly @pdeffebach wants a function wrapper like passmissing)

bkamins avatar Feb 13 '23 08:02 bkamins