DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Add a skipmissing kwarg to select/transform/combine

Open matthieugomez opened this issue 4 years ago • 19 comments

I was thinking that it would be nice to have a skipmissing keyword argument in transform/select/combine

The idea is that it would only use rows where none of the observations for the variables used in the call are missing. (It is possible because user must already specify which variables are used in a call to transform/select/combine).

That would make it easier to skip missing without, and, in particular, to handle the case of several variables (or weights).

transform(df, [:x, :y] => (x,y) -> corr(x,y), skipmissing = true)

matthieugomez avatar May 15 '20 13:05 matthieugomez

This would essentially be a shorthand for transform!(dropmissing(df, cols), ...).

The benefit would be that you would have automatically resolved cols argument to dropmissing.

The problem is that it would break the invariant of transform that we do not change the number of rows in the returned value.

Given the functionality is easy enough to be obtained via dropmissing I am not 100% sure if it is worth to add it. But let us wait for other users to comment on this.

bkamins avatar May 15 '20 13:05 bkamins

In my mind it would not change the number of rows. It would just compute the function using rows where none of the cols is missing (and return a missing value for rows where one of the cols is missing).

matthieugomez avatar May 15 '20 13:05 matthieugomez

Something like

function wrapper(fun, x, y)
    sx, sy = Missings.skipmissings(x, y) # need to be on Missings master
    sub_out = fun(sx, sy)
    full_out = Vector{Union{eltype(sub_out), Missing}}(missing, length(x))
    full_out[eachindex(sx)] .= sub_out # eachindex(sx) returns indices of complete cases

    return full_out
end

Note that currently corr(sx, sy) doesn't work... see here.

pdeffebach avatar May 15 '20 16:05 pdeffebach

Yes I guess this would be quite convenient. In practice most people who work with data would almost never have to use the skipmissing function directly.

But to be efficient, the implementation cannot always collect the result. The only way to be really fast in all cases is to wrap the columns in skipmissing or skipmissings (depending on the number of columns). So for cor to work we still need something like https://github.com/JuliaLang/Statistics.jl/pull/34 (which collects the result because it's relatively fast, but that's not the case for all functions).

nalimilan avatar May 18 '20 10:05 nalimilan

Alternatively, it could apply the function on a view of columns no?

matthieugomez avatar May 18 '20 13:05 matthieugomez

No, since you would need to go over all values first just to find out whether they are missing or not. That's quite costly if all you want to do is e.g. compute the mean. That's why we have skipmissing/skipmissings: it's basically a lazy view.

nalimilan avatar May 18 '20 15:05 nalimilan

I just looked over the select and transform code. Thanks to @bkamins's hard work, this would actually not be too hard to implement. It would require more if statements for sure, but it's not crazy.

pdeffebach avatar May 18 '20 16:05 pdeffebach

Above we took the case of a function returning a scalar, in which case the value can just be broadcasted to match the expected number of rows. But we haven't discussed the situation where you would pass a function returning the same number of rows as the input, like + or ByRow(f). In that case, @matthieugomez noted that we should avoid calling the function on rows with missing values, and use missing for them. This is equivalent to passmissing.

Implementing this is easy in the ByRow case, as we can just wrap the function in passmissing. But it's more tricky for the general case: if we wrap the columns in skipmissings as I suggested above, the function will can then return either a scalar (as in the case discussed above) or a vector with as many elements as there are non-missing values in the input. In the latter case, we have to copy values to a new vector with one entry per row, inserting missing values in the right places. This is trivial to do e.g. using a SubArray view of non-missing rows. But that requires computing the indices of these rows, which skipmissings doesn't do (for performance reasons).

So we would have to lose a bit of performance to be able to support this. Maybe it's not the end of the world given that it would make things simpler: we would be able to pass the user-provided function a SubArray of listwise complete observations instead of a SkipMissings object, so everything would work like for plain Vector (e.g. +). People who use functions that return a scalar (which isn't the most common case for select and transform) and want something faster can use skipmissings manually.

Finally, note that this approach won't work for combine, as we don't know the expected number of rows. There, skipmissing=true would just skip rows with missing, without filling the corresponding entries with missing. That's probably OK but it's worth highlighting. To stress the difference, we could call the argument to select/transform passmissing instead of skipmissing, though that would make things harder to grasp for users.

nalimilan avatar Jul 07 '20 20:07 nalimilan

I answer to https://github.com/JuliaData/DataFrames.jl/issues/2314#issuecomment-670060388 here, per suggestion of @nalimilan.

With select, skipmissing would apply the functions on row for which none of the argument is missing

This is not what select does by default. It does not apply functions to rows. It passes the whole vectors to the functions. So I am not clear what would be passed to such functions.

With combine, skipmissing would apply the functions on rows for which none of the argument is missing.

Again combine by default passes whole columns. Now what you propose is problematic in e.g. the following case:

julia> df = DataFrame(x=[1,missing,2], y=1:3)
3×2 DataFrame
│ Row │ x       │ y     │
│     │ Int64?  │ Int64 │
├─────┼─────────┼───────┤
│ 1   │ 1       │ 1     │
│ 2   │ missing │ 2     │
│ 3   │ 2       │ 3     │

julia> combine(df, :x => identity, :y => identity)
3×2 DataFrame
│ Row │ x_identity │ y_identity │
│     │ Int64?     │ Int64      │
├─────┼────────────┼────────────┤
│ 1   │ 1          │ 1          │
│ 2   │ missing    │ 2          │
│ 3   │ 2          │ 3          │

as with skipmissing it would be not clear how many rows should be retained in such a function.

By the way, I am not sure why combine(df, :x) or combine(df, :x => ByRow(.))

Well they are accepted because they follow standard rules:

  • combine(df, :x) gets expanded to combine(df, :x => identity => :x) and all is consistent what happens
  • combine(df, :x => ByRow(f)) is just a shorthand for combine(df, :x => x -> f.(x)) and again all is consistent

(so in short - please remember that we ALWAYS transform any expression to a canonical form src => fun => dst before evaluation, the shorthand forms are helpers that just make things easier to write - but every such expression has a well defined interpretation using src => fun => dst minilanguage)

bkamins avatar Aug 06 '20 17:08 bkamins

I understood your question as asking the desired behavior, not how to implement it with the current implementation of these functions in DataFrames.

As for combine(df, :x => identity, :y => identity, skipmissing = true), it would report "ERROR: ArgumentError: New columns must have the same length as old columns". Exactly the same as what happens when doing combine(df, :x => collect ∘ skipmissing, :y => collect ∘ skipmissing).

matthieugomez avatar Aug 06 '20 17:08 matthieugomez

Then for combine this can be the rule you propose and it is consistent. I understand that for ByRow you want combine to behave like select described below (i.e. if row contains missing store missing in the result).

However, for select I would need to understand more exactly what you want. So you are saying the following for e.g. select(df, :x => fun => :y, skipmissing=true) should happen (not that it has to be implemented this way but to make it precise what should happen):

  • allocate a new column y initially holding missing
  • find all indices of non-missing rows in :x, call it idxs
  • pass df.x[idxs] to fun and collect the result res
  • perform pseudo-broadcasting of res into view(y, idxs) (following the standard rules of pseudo-broadcasting in DataFrames.jl)
  • store y in the variable :y in the target data frame

Did I understand correctly what you would like to have?

bkamins avatar Aug 06 '20 17:08 bkamins

Then for combine this can be the rule you propose and it is consistent. I understand that for ByRow you want combine to behave like select described below (i.e. if row contains missing store missing in the result).

No, I think combine should just return the result on the function applied on collect(skipmissing(x)). I don't really think that anyone would use combine with ByRow anyway, so it does not really matter. But the rule is just more consistent this way.

However, for select I would need to understand more exactly what you want. So you are saying the following for e.g. select(df, :x => fun => :y, skipmissing=true) should happen (not that it has to be implemented this way but to make it precise what should happen):

  • allocate a new column y initially holding missing
  • find all indices of non-missing rows in :x, call it idxs
  • pass df.x[idxs] to fun and collect the result res
  • perform pseudo-broadcasting of res into view(y, idxs) (following the standard rules of pseudo-broadcasting in DataFrames.jl)
  • store y in the variable :y in the target data frame

Yes exactly.

matthieugomez avatar Aug 06 '20 18:08 matthieugomez

No, I think combine should just return the result on the function applied on collect(skipmissing(x)).

Ah - OK, so then combine and select would be significantly different indeed.

I don't really think that anyone would use combine with ByRow anyway

Actually if we define the rule you propose it will make sense 😄 - depending on what you want in the output you will use combine or select.

bkamins avatar Aug 06 '20 18:08 bkamins

Ah - OK, so then combine and select would be significantly different indeed.

I think collect(skipmissing(x)) was meant to be an approximation. The skipmissing part of both functions should be the same. But I agree that we need to be stricter about the length of the returned columns in combine because we don't have a reference length we know everything should be and fill in missing accordingly.

However see #2211. Are we over-thinking this? Perhaps we can just operate on a SubDataFrame and overload those functions instead of having a skipmissing keyword argument.

pdeffebach avatar Aug 06 '20 22:08 pdeffebach

I think the difference is relevant with piping. When we make filter produce views and then allow adding columns to SubDataFrame that is not subsetted on columns you will be able to write something like:

@pipe df |> filter(:x => !ismissing, _, view=true) |> transform(_, :x => abs) |> parent |> ... some more piping ...

and the issue is that you have to use parent to unwrap the filter you have applied.

Having skipmissing argument for e.g. transform would not require this wrapping/unwrappng, and you would just write:

@pipe df |> |> transform(_, :x => abs, skipmissing=true) |> ... some more piping ...

so I would be ok with adding this skipmissing kwarg.

But for me (as you probably know 😄) the crucial thing is to precisely understand the desired functionality and the dimensions here are:

  • filter (this is a separate case for sure - that is why I opened a separate issue for it)
  • combine vs select/transform combined with "whole column" vs ByRow transformation (and here we have 4 possibilities and it would be great if we all had a clear vision and consensus how each of them should work)

bkamins avatar Aug 06 '20 22:08 bkamins

I agree with @matthieugomez 's interpretation. select and transform return as many rows as the input so it's natural that we require the user function to return as many (non-missing rows) as we passed to it, and fill others with missing. OTC combine can return anything, so we should just pass the user function a view of complete rows and take whatever it returns. ByRow shouldn't get any special treatment, it's just the application of the general rules to a broadcasted function.

nalimilan avatar Aug 07 '20 08:08 nalimilan

Having skipmissing argument for e.g. transform would not require this wrapping/unwrappng, and you would just write:

@pipe df |> |> transform(_, :x => abs, skipmissing=true) |> ... some more piping ...

Sorry to have this discussion across 2 issues, but I think it would be good to "un-filter" automatically in a transform call since that is what we do with a grouped data frame.

What I think you are saying, though, is that this would require a special type because it is a slightly different logic than what one would expect from a normal SubDataFrame

pdeffebach avatar Aug 07 '20 13:08 pdeffebach

What I think you are saying, though, is that this would require a special type because it is a slightly different logic than what one would expect from a normal SubDataFrame.

Yes

bkamins avatar Aug 07 '20 14:08 bkamins

SkipMissing([:a,:b]) => or => SkipMissing(corr) => would be more flexible than skipmissing=true because it can be applied to individual operations rather than the whole transform call.

jtrakk avatar Sep 12 '21 20:09 jtrakk