DataFrames.jl
DataFrames.jl copied to clipboard
Add a skipmissing kwarg to select/transform/combine
I was thinking that it would be nice to have a skipmissing keyword argument in transform/select/combine
The idea is that it would only use rows where none of the observations for the variables used in the call are missing. (It is possible because user must already specify which variables are used in a call to transform/select/combine).
That would make it easier to skip missing without, and, in particular, to handle the case of several variables (or weights).
transform(df, [:x, :y] => (x,y) -> corr(x,y), skipmissing = true)
This would essentially be a shorthand for transform!(dropmissing(df, cols), ...)
.
The benefit would be that you would have automatically resolved cols
argument to dropmissing
.
The problem is that it would break the invariant of transform
that we do not change the number of rows in the returned value.
Given the functionality is easy enough to be obtained via dropmissing
I am not 100% sure if it is worth to add it. But let us wait for other users to comment on this.
In my mind it would not change the number of rows. It would just compute the function using rows where none of the cols is missing (and return a missing value for rows where one of the cols is missing).
Something like
function wrapper(fun, x, y)
sx, sy = Missings.skipmissings(x, y) # need to be on Missings master
sub_out = fun(sx, sy)
full_out = Vector{Union{eltype(sub_out), Missing}}(missing, length(x))
full_out[eachindex(sx)] .= sub_out # eachindex(sx) returns indices of complete cases
return full_out
end
Note that currently corr(sx, sy)
doesn't work... see here.
Yes I guess this would be quite convenient. In practice most people who work with data would almost never have to use the skipmissing
function directly.
But to be efficient, the implementation cannot always collect the result. The only way to be really fast in all cases is to wrap the columns in skipmissing
or skipmissings
(depending on the number of columns). So for cor
to work we still need something like https://github.com/JuliaLang/Statistics.jl/pull/34 (which collects the result because it's relatively fast, but that's not the case for all functions).
Alternatively, it could apply the function on a view of columns no?
No, since you would need to go over all values first just to find out whether they are missing or not. That's quite costly if all you want to do is e.g. compute the mean. That's why we have skipmissing
/skipmissings
: it's basically a lazy view.
I just looked over the select
and transform
code. Thanks to @bkamins's hard work, this would actually not be too hard to implement. It would require more if
statements for sure, but it's not crazy.
Above we took the case of a function returning a scalar, in which case the value can just be broadcasted to match the expected number of rows. But we haven't discussed the situation where you would pass a function returning the same number of rows as the input, like +
or ByRow(f)
. In that case, @matthieugomez noted that we should avoid calling the function on rows with missing values, and use missing
for them. This is equivalent to passmissing
.
Implementing this is easy in the ByRow
case, as we can just wrap the function in passmissing
. But it's more tricky for the general case: if we wrap the columns in skipmissings
as I suggested above, the function will can then return either a scalar (as in the case discussed above) or a vector with as many elements as there are non-missing values in the input. In the latter case, we have to copy values to a new vector with one entry per row, inserting missing values in the right places. This is trivial to do e.g. using a SubArray
view of non-missing rows. But that requires computing the indices of these rows, which skipmissings
doesn't do (for performance reasons).
So we would have to lose a bit of performance to be able to support this. Maybe it's not the end of the world given that it would make things simpler: we would be able to pass the user-provided function a SubArray
of listwise complete observations instead of a SkipMissings
object, so everything would work like for plain Vector
(e.g. +
). People who use functions that return a scalar (which isn't the most common case for select
and transform
) and want something faster can use skipmissings
manually.
Finally, note that this approach won't work for combine
, as we don't know the expected number of rows. There, skipmissing=true
would just skip rows with missing, without filling the corresponding entries with missing
. That's probably OK but it's worth highlighting. To stress the difference, we could call the argument to select
/transform
passmissing
instead of skipmissing
, though that would make things harder to grasp for users.
I answer to https://github.com/JuliaData/DataFrames.jl/issues/2314#issuecomment-670060388 here, per suggestion of @nalimilan.
With
select
,skipmissing
would apply the functions on row for which none of the argument is missing
This is not what select
does by default. It does not apply functions to rows. It passes the whole vectors to the functions. So I am not clear what would be passed to such functions.
With
combine
,skipmissing
would apply the functions on rows for which none of the argument is missing.
Again combine
by default passes whole columns. Now what you propose is problematic in e.g. the following case:
julia> df = DataFrame(x=[1,missing,2], y=1:3)
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64? │ Int64 │
├─────┼─────────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ missing │ 2 │
│ 3 │ 2 │ 3 │
julia> combine(df, :x => identity, :y => identity)
3×2 DataFrame
│ Row │ x_identity │ y_identity │
│ │ Int64? │ Int64 │
├─────┼────────────┼────────────┤
│ 1 │ 1 │ 1 │
│ 2 │ missing │ 2 │
│ 3 │ 2 │ 3 │
as with skipmissing
it would be not clear how many rows should be retained in such a function.
By the way, I am not sure why
combine(df, :x)
orcombine(df, :x => ByRow(.))
Well they are accepted because they follow standard rules:
-
combine(df, :x)
gets expanded tocombine(df, :x => identity => :x)
and all is consistent what happens -
combine(df, :x => ByRow(f))
is just a shorthand forcombine(df, :x => x -> f.(x))
and again all is consistent
(so in short - please remember that we ALWAYS transform any expression to a canonical form src => fun => dst
before evaluation, the shorthand forms are helpers that just make things easier to write - but every such expression has a well defined interpretation using src => fun => dst
minilanguage)
I understood your question as asking the desired behavior, not how to implement it with the current implementation of these functions in DataFrames.
As for combine(df, :x => identity, :y => identity, skipmissing = true)
, it would report "ERROR: ArgumentError: New columns must have the same length as old columns". Exactly the same as what happens when doing combine(df, :x => collect ∘ skipmissing, :y => collect ∘ skipmissing)
.
Then for combine
this can be the rule you propose and it is consistent. I understand that for ByRow
you want combine
to behave like select
described below (i.e. if row contains missing
store missing
in the result).
However, for select
I would need to understand more exactly what you want. So you are saying the following for e.g. select(df, :x => fun => :y, skipmissing=true)
should happen (not that it has to be implemented this way but to make it precise what should happen):
- allocate a new column
y
initially holdingmissing
- find all indices of non-missing rows in
:x
, call itidxs
- pass
df.x[idxs]
tofun
and collect the resultres
- perform pseudo-broadcasting of
res
intoview(y, idxs)
(following the standard rules of pseudo-broadcasting in DataFrames.jl) - store
y
in the variable:y
in the target data frame
Did I understand correctly what you would like to have?
Then for
combine
this can be the rule you propose and it is consistent. I understand that forByRow
you wantcombine
to behave likeselect
described below (i.e. if row containsmissing
storemissing
in the result).
No, I think combine
should just return the result on the function applied on collect(skipmissing(x))
. I don't really think that anyone would use combine
with ByRow
anyway, so it does not really matter. But the rule is just more consistent this way.
However, for
select
I would need to understand more exactly what you want. So you are saying the following for e.g.select(df, :x => fun => :y, skipmissing=true)
should happen (not that it has to be implemented this way but to make it precise what should happen):
- allocate a new column
y
initially holdingmissing
- find all indices of non-missing rows in
:x
, call itidxs
- pass
df.x[idxs]
tofun
and collect the resultres
- perform pseudo-broadcasting of
res
intoview(y, idxs)
(following the standard rules of pseudo-broadcasting in DataFrames.jl)- store
y
in the variable:y
in the target data frame
Yes exactly.
No, I think combine should just return the result on the function applied on
collect(skipmissing(x))
.
Ah - OK, so then combine
and select
would be significantly different indeed.
I don't really think that anyone would use combine with
ByRow
anyway
Actually if we define the rule you propose it will make sense 😄 - depending on what you want in the output you will use combine
or select
.
Ah - OK, so then
combine
andselect
would be significantly different indeed.
I think collect(skipmissing(x))
was meant to be an approximation. The skipmissing
part of both functions should be the same. But I agree that we need to be stricter about the length of the returned columns in combine
because we don't have a reference length we know everything should be and fill in missing
accordingly.
However see #2211. Are we over-thinking this? Perhaps we can just operate on a SubDataFrame
and overload those functions instead of having a skipmissing
keyword argument.
I think the difference is relevant with piping. When we make filter
produce views and then allow adding columns to SubDataFrame
that is not subsetted on columns you will be able to write something like:
@pipe df |> filter(:x => !ismissing, _, view=true) |> transform(_, :x => abs) |> parent |> ... some more piping ...
and the issue is that you have to use parent
to unwrap the filter
you have applied.
Having skipmissing
argument for e.g. transform
would not require this wrapping/unwrappng, and you would just write:
@pipe df |> |> transform(_, :x => abs, skipmissing=true) |> ... some more piping ...
so I would be ok with adding this skipmissing
kwarg.
But for me (as you probably know 😄) the crucial thing is to precisely understand the desired functionality and the dimensions here are:
-
filter
(this is a separate case for sure - that is why I opened a separate issue for it) -
combine
vsselect
/transform
combined with "whole column" vsByRow
transformation (and here we have 4 possibilities and it would be great if we all had a clear vision and consensus how each of them should work)
I agree with @matthieugomez 's interpretation. select
and transform
return as many rows as the input so it's natural that we require the user function to return as many (non-missing rows) as we passed to it, and fill others with missing
. OTC combine
can return anything, so we should just pass the user function a view of complete rows and take whatever it returns. ByRow
shouldn't get any special treatment, it's just the application of the general rules to a broadcasted function.
Having
skipmissing
argument for e.g.transform
would not require this wrapping/unwrappng, and you would just write:@pipe df |> |> transform(_, :x => abs, skipmissing=true) |> ... some more piping ...
Sorry to have this discussion across 2 issues, but I think it would be good to "un-filter" automatically in a transform
call since that is what we do with a grouped data frame.
What I think you are saying, though, is that this would require a special type because it is a slightly different logic than what one would expect from a normal SubDataFrame
What I think you are saying, though, is that this would require a special type because it is a slightly different logic than what one would expect from a normal SubDataFrame.
Yes
SkipMissing([:a,:b]) =>
or => SkipMissing(corr) =>
would be more flexible than skipmissing=true
because it can be applied to individual operations rather than the whole transform
call.