DataFrames.jl
DataFrames.jl copied to clipboard
Skipping missing values more easily
It seems that dealing with missing values is one of the most painful issues we have, which goes against the very powerful and convenient DataFrames API. Having to write things like filter(:col => x -> coalesce(x > 1, false), df)
or combine(gd, :col => (x -> sum(skipmissing(x)))
isn't ideal. One proposal to alleviate this is https://github.com/JuliaData/DataFrames.jl/issues/2258: add a skipmissing
argument to functions like filter
, select
, transform
and combine
to unify the way one can skip missing values, instead of having to use different syntaxes which are hard to grasp for newcomers and make the code more complex to read.
That would be one step towards being more user-friendly, but one would still have to repeat skipmissing=true
all the time when dealing with missing values. I figured two solutions could be considered to improve this:
- Have DataFramesMeta simplify the handling of missing values. This could be via a dedicated e.g.
@linqskipmissing
macro or a statement likeskipmissing
within a@linq
block that would automatically passskipmissing=true
to all subsequent operations in a chain or block. This wouldn't really help with operations outside such blocks though. - Have a field in
DataFrame
objects that would store the default value to use for theskipmissing
argument. By default it would befalse
, so that you get the current behavior, which ensures safety via propagation of missing values. But when you know you are working with a data set with missing values, you would be able to callskipmissing!(df, true)
once and then avoid repeating it.
Somewhat similar discussions have happened a long time ago (but at the array rather than the data frame level) at https://github.com/JuliaStats/DataArrays.jl/issues/39. I think it's fair to say that we know have enough experience now to make a decision. One argument against implementing this at the DataFrame
level is that it will have no effect on operations applied directly to column vectors, like sum(df.col)
. But that's better than nothing.
Cc: @bkamins, @matthieugomez, @pdeffebach, @mkborregaard
- Have DataFramesMeta simplify the handling of missing values. This could be via a dedicated e.g.
@linqskipmissing
macro or a statement likeskipmissing
within a@linq
block that would automatically passskipmissing=true
to all subsequent operations in a chain or block. This wouldn't really help with operations outside such blocks though.
This is a good idea, however I also like skipmissing = true
at the level of a transform
call or even at the level of argument because it's explicit.
Perhaps DataFramesMeta could provide a block-level skipmissing
option as well as a macro like
@transform(df, @sm y = :x1 .+ mean(:x2))
Where @sm
macro, does the necessary transformation described in #2258
function wrapper(fun, x, y)
sx, sy = Missings.skipmissings(x, y) # need to be on Missings master
sub_out = fun(sx, sy)
full_out = Vector{Union{eltype(sub_out), Missing}}(missing, length(x))
full_out[eachindex(sx)] .= sub_out # eachindex(sx) returns indices of complete cases
return full_out
end
- Have a field in
DataFrame
objects that would store the default value to use for theskipmissing
argument. By default it would befalse
, so that you get the current behavior, which ensures safety via propagation of missing values. But when you know you are working with a data set with missing values, you would be able to callskipmissing!(df, true)
once and then avoid repeating it.
I'm not a fan of this idea since it's behavior depending on a global state that could be set a long ways from where the transform call is. It seems like it could cause debugging to be a pain.
It's great you're thinking about how to make working with missing values easier! I 100% agree.
A macro may be good.
I'm not sure I'm convinced by an option at the level of the dataframe. It sounds complicated to keep track of it. For instance, won't people be confused that stuff like df = merge(df, df_using)
does not retain the option?
A third possibility would be to allow users to change the default option for transform
, etc, say by writing SKIPMISSING = true
at the start of a script.
To throw in a much more crazy idea: could we use contextual dispatch to override the behavior of all reductions and comparisons within a whole expression so that they are "missing-permissive"?
I've long been frustrated with the difficulty of working with missing
given that it infects all downstream operations and wished it worked differently in Base. (I acknowledge my frustration could be misguided — perhaps it's saved me from some horrible bugs without knowing it :-) )
Perhaps DataFramesMeta could provide a block-level
skipmissing
option as well as a macro like@transform(df, @sm y = :x1 .+ mean(:x2))
Where
@sm
macro, does the necessary transformation described in #2258
@pdeffebach Yes that was more or less what I had in mind. Though if you have to repeat this for each call it's not much better than passing skipmissing=true
. Being able to apply it to a chain of operations would already be more useful.
I'm not a fan of this idea since it's behavior depending on a global state that could be set a long ways from where the transform call is. It seems like it could cause debugging to be a pain.
@pdeffebach Yes. OTOH it's not so different from e.g. creating a column: if you get an error because it doesn't exist or it's incorrect you have to find where it's been defined, which can be quite far from where the error happens.
I'm not sure I'm convinced by an option at the level of the dataframe. It sounds complicated to keep track of it. For instance, won't people be confused that stuff like
df = merge(df, df_using)
does not retain the option?A third possibility would be to allow users to change the default option for
transform
, etc, say by writingSKIPMISSING = true
at the start of a script.
@matthieugomez Yes, losing the option after transformations could be annoying. Though it could be propagated across some operations which always preserve missing values: in these cases there's little point in forcing you to repeat that you know there are missing values. Where safety checks matter is when you could believe you got rid of missing values and for some reason it's not the case. But I admit that this option would decrease safety (even if not propagated automatically) as you could pass a data frame to a function which isn't prepared to handle missing values and it would silently skip them.
A global setting would only aggravate these issues IMHO since it would affect completely unrelated operations, possibly in packages, some of which may rely on the implicit skipmissing=false
.
To throw in a much more crazy idea: could we use contextual dispatch to override the behavior of all reductions and comparisons within a whole expression so that they are "missing-permissive"?
@c42f My suggestion about DataFramesMeta is a kind of limited way to change the behavior of a whole block. Using Cassette.jl (which I guess is what you mean by "contextual dispatch"?) would indeed be a more general approach and it has been mentioned several times. Actually I just tried that and it turns out to be very simple to make it propagate missing values:
using Cassette, Missings
Cassette.@context PassMissingCtx
Cassette.overdub(ctx::PassMissingCtx, f, args...) = passmissing(f)(args...)
# do not lift special functions which already handle missing
for f in (:ismissing, :(==), :isequal, :(===), :&, :|, :⊻)
@eval begin
Cassette.overdub(ctx::PassMissingCtx, ::typeof($f), args...) = $f(args...)
end
end
f(x) = x > 0 ? log(x) : -Inf
julia> Cassette.@overdub PassMissingCtx() f(missing)
missing
julia> Cassette.@overdub PassMissingCtx() f(1)
0.0
julia> Cassette.@overdub PassMissingCtx() ismissing(missing)
true
julia> Cassette.@overdub PassMissingCtx() missing | true
true
Skipping missing values in reductions will be a little harder, but it's doable if we only want to handle a known list of functions. For example this quick hack works:
Cassette.overdub(ctx::PassMissingCtx, ::typeof(sum), x) = sum(skipmissing(x))
julia> x = [1, missing];
julia> Cassette.@overdub PassMissingCtx() sum(x)
1
Maybe this kind of thing could be made simpler to use by providing a macro like @passmissing
as a shorthand for Cassette.@overdub PassMissingCtx()
. But it's important to measure all the implications of this approach: since it will affect all function calls deep in the code that you didn't write yourself, it will have lots of unintended side effects. For example, Cassette.@overdub PassMissingCtx() [missing]
gives missing
, which will break lots of package code. A safer approach would be to only apply passmissing
to a whitelist of functions for which it makes sense (scalar functions, mainly), like DataValues does -- with the drawback that the list is kind of arbitrary.
In the end, maybe Cassette is too powerful for what we actually need in the context of DataFrames. In practice with select
/transform
/combine
it makes sense to only apply passmissing
to the top-level functions, rather than recursively. And for reductions, passing views of complete rows as proposed at https://github.com/JuliaData/DataFrames.jl/issues/2258 should be enough, when combined with a convenient DataFramesMeta syntax.
My suggestion about DataFramesMeta is a kind of limited way to change the behavior of a whole block. Using Cassette.jl (which I guess is what you mean by "contextual dispatch"?) would indeed be a more general approach and it has been mentioned several times. Actually I just tried that and it turns out to be very simple to make it propagate missing values:
Very cool. IIUC people have started to use other libraries like IRTools which plug into the compiler in a similar way to Cassette but are not Cassette itself, but yes, that's roughly what I had in mind. One downside is that it's a pretty heavy weight tool to deploy for something like missing.
since it will affect all function calls deep in the code that you didn't write yourself, it will have lots of unintended side effects
I think this is the bigger problem; it's not clear how deep to recurse and it could definitely have unintended consequences. It's a very similar problem with floating point rounding modes, where having the rounding mode as dynamic state really doesn't work well as it infects other calculations which weren't programmed to deal with a different rounding mode.
So I agree it does seem much safer and more sensible to have a macro which lowers only the syntax within the immediate expression to be more permissive with missing
. Actually I wonder whether you could do something more like broadcast lowering where all function calls within the expression are lifted in a certain way such that dispatch can be customized, so, eg, @linq f(x,y,z)
becomes linq_missing(f, x, y, z)
.
In terms of your examples,
@linq filter(:col =>(x -> x > 1), df)
# means
linq_missing(filter, :col => (x -> linq_missing(>, x, 1)), df)
@linq combine(gd, :col => (x -> sum(x))
# means
linq_missing(combine, gd, :col => (x -> linq_missing(sum, x)))
Something like this has very regular lowering rules and gives a measure of extensibility for user defined functions.
One alternative is that filter/transform/combine could always skip missing (i.e. kwarg skipmissing = true
is the default).
Can you please help me understanding this proposal in detail? What should be the outcome of the following operations if we added skipmissing
option (now I am showing what we get now and would like to understand what we should get if we add skipmissing=true
):
julia> df = DataFrame(x=[1,missing,2])
3×1 DataFrame
│ Row │ x │
│ │ Int64? │
├─────┼─────────┤
│ 1 │ 1 │
│ 2 │ missing │
│ 3 │ 2 │
julia> select(df, :x)
3×1 DataFrame
│ Row │ x │
│ │ Int64? │
├─────┼─────────┤
│ 1 │ 1 │
│ 2 │ missing │
│ 3 │ 2 │
julia> select(df, :x => identity)
3×1 DataFrame
│ Row │ x_identity │
│ │ Int64? │
├─────┼────────────┤
│ 1 │ 1 │
│ 2 │ missing │
│ 3 │ 2 │
julia> select(df, :x => ByRow(identity))
3×1 DataFrame
│ Row │ x_identity │
│ │ Int64? │
├─────┼────────────┤
│ 1 │ 1 │
│ 2 │ missing │
│ 3 │ 2 │
julia> combine(df, :x)
3×1 DataFrame
│ Row │ x │
│ │ Int64? │
├─────┼─────────┤
│ 1 │ 1 │
│ 2 │ missing │
│ 3 │ 2 │
julia> combine(df, :x => identity)
3×1 DataFrame
│ Row │ x_identity │
│ │ Int64? │
├─────┼────────────┤
│ 1 │ 1 │
│ 2 │ missing │
│ 3 │ 2 │
julia> combine(df, :x => ByRow(identity))
3×1 DataFrame
│ Row │ x_identity │
│ │ Int64? │
├─────┼────────────┤
│ 1 │ 1 │
│ 2 │ missing │
│ 3 │ 2 │
AFAICT this question regards #2258, better keep matters separate as discussions are already tricky enough.
With select
, skipmissing
would apply the functions on row for which
none of the argument is missing, return the output on these rows, and
missing otherwise (to maintain the same number of rows as the original
dataframe). So skipmissing = true
would not change anything in your
example. Similarly for transform
.
With combine
, skipmissing
would apply the functions on rows for which
none of the argument is missing.
julia> combine(df, :x)
3×1 DataFrame
│ Row │ x │
│ │ Int64? │
├─────┼─────────┤
│ 1 │ 1 │
│ 2 │ 2 │
julia> combine(df, :x => identity)
3×1 DataFrame
│ Row │ x_identity │
│ │ Int64? │
├─────┼────────────┤
│ 1 │ 1 │
│ 2 │ 2 │
julia> combine(df, :x => ByRow(identity))
3×1 DataFrame
│ Row │ x_identity │
│ │ Int64? │
├─────┼────────────┤
│ 1 │ 1 │
│ 2 │ 2 │
That being said, I am not sure why combine(df, :x)
or combine(df, :x => ByRow(.))
are accepted and it sounds that maybe it would be clearer to
disallow them?
Ah - sorry - I thought this issue "overriden" that one. So actually this issue should be on hold till we resolve #2258, because it only asks to make #2258 simpler - right?
Could we think about making filter
/transform
/combine
/select
have the kwarg skipmissing = true
by default? It's not unheard of — that's what Stata and Panda (for combine
) do. Also, data.table
and dplyr
automatically skip missing in their versions of filter
.
I would not have a problem with this. We already have it in groupby
so it would be consistent. So I assume you want:
- for
filter
wrap a predicatep
inx -> coalesce(p(x), false)
- for
transform
/combine
/select
we have two decisions:- do we pass
view
s orskipmissing
wrappers? (given the second question probablyview
s, it is going to be a bit slow, but I thinkskipmissing=true
does not have to be super fast as it is a convenience wrapper) - what do we do if multiple columns are passed - skip rows that have at least one
missing
? (this is whatgroupby
does)
- do we pass
Regarding 2, note that as discussed at https://github.com/JuliaData/DataFrames.jl/issues/2258 we have to pass views for select
and transform
as we need to be able to reassign values to the non-missing rows in the input. When multiple columns are passed, we should only keep complete observations, otherwise something like [:x, :y] => +
wouldn't work due to different lengths.
I should have noted that before thinking about making it the default, we should first implement this keyword argument and see how it goes.
@nalimilan Yes starting with a kwarg and see how it goes is the right way. The only reason I was mentioning the default value is that 1.0 may mean that this kind of stuff won't be able to change later on — I hope it's not the case.
@bkamins I agree with 1 and 2.1 (views). I think for the case of [:x, :y] => +
it should skip missing on both (as @nalimilan points out), but for [:x, :y] .=> mean
, or for :x => mean, :y => mean
, it should skipmissing on x
and y
separately.
Yes - for [:x, :y] .=> mean
this is separate (which means that transform
/select
will throw an error in cases when there is no match and a vector is returned).
OK - so it seems we have a consensus here. I will propose a PR.
Just a maintenance question - when we add skipmissing
kwarg should both #2314 and #2258 be closed or something more should be done/kept track of?
That would be awesome, thanks! Jus to make sure I understand, what do you mean by `transform/select will throw an error in cases when there is no match and a vector is returned).'?
If this happens, https://github.com/JuliaData/DataFrames.jl/issues/2258 should definitely be closed, but this thread should be open if the default value is false, since it may still be cumbersome to add it at every command.
transform/select will throw an error in cases when there is no match and a vector is returned
I mean tat this would still error:
julia> df = DataFrame(a=[1,missing,2], b=[1,missing,missing])
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ 1 │ 1 │
│ 2 │ missing │ missing │
│ 3 │ 2 │ missing │
julia> select(df, :a => collect∘skipmissing)
ERROR: ArgumentError: length 2 of vector returned from function #62 is different from number of rows 3 of the source data frame.
(but I guess this is natural to do it this way)
this thread should be open if the default value is false
OK - we can keep it open then. For 1.0 release the default will be false to be non-breaking.
That would be awesome, thanks!
Started thinking about it :). The trickiest part will be fast aggregation functions for GroupedDataFrame
case (as usual), but also they should be doable.
transform/select will throw an error in cases when there is no match and a vector is returned
I mean tat this would still error:
julia> df = DataFrame(a=[1,missing,2], b=[1,missing,missing]) 3×2 DataFrame │ Row │ a │ b │ │ │ Int64? │ Int64? │ ├─────┼─────────┼─────────┤ │ 1 │ 1 │ 1 │ │ 2 │ missing │ missing │ │ 3 │ 2 │ missing │ julia> select(df, :a => collect∘skipmissing) ERROR: ArgumentError: length 2 of vector returned from function #62 is different from number of rows 3 of the source data frame.
(but I guess this is natural to do it this way)
Just to clarify, I think the following should work:
select(df, :a => collect, skipmissing = true)
and actually also
select(df, :a => collect∘skipmissing , skipmissing = true)
since the function collect∘skipmissing is one to one on the set of values for which a is not missing. In both cases, it just returns the same thing as a
.
However, the following should error
combine(df, :a => collect, :b => collect, skipmissing = true)
because the length of collect∘skipmissing(a)
is different from the length of collect∘skipmissing(b)
Just to clarify, I think the following should work:
select(df, :a => collect, skipmissing = true)
Thank you for commenting on this, as (sorry if I have forgotten some earlier discussions) you want to:
select(df, :a => collect , skipmissing = true)
to work but
select(df, :a => collect∘skipmissing)
will fail.
Which means to me that if we want to add such a kwarg it should not be called skipmissing
as I am afraid it would lead to a confusion. Also I would even say that this kwarg should be reserved for select
and transform
as in combine
you expect a different behaviour.
In particular:
select(df, :a => mean, skipmissing = true)
and
select(df, :a => mean∘skipmissing)
would both work but produce different results.
My proposal above about "spreading" missing values seems relevant here.
select(df, :a => collect, skipmissing = true)
This takes :a
, applies skipmissing
, collects the result, then loops through indices of :a
and, when not missing`, fills it in.
select(df, :a => collect∘skipmissing , skipmissing = true)
This takes :a
, applies skipmissing
, then applies it again, collect
s the result, and loops through indices of :a
as before.
select(df, :a => mean, skipmissing = true)
Takes :a
, applies skipmissing
, takes the mean, then loops through and fills in indices of :a
where indices of :a
are not missing. Perhaps we can make an exception for scalar values? If a scalar is returned, it gets spread across the entire vector, regardless of missing values? That way it would match
select(df, :a => mean∘skipmissing; skipmissing = false)
EDIT: After playing around with R, now I'm not so sure. I think the biggest annoyance is with filter
currently and we should start with that since everyone agrees on that behavior and we are confident it won't result in unpredictable / inconsistent behavior.
Just to clarify, I think the following should work: select(df, :a => collect, skipmissing = true)
Thank you for commenting on this, as (sorry if I have forgotten some earlier discussions) you want to:
select(df, :a => collect , skipmissing = true)
to work but
select(df, :a => collect∘skipmissing)
will fail.
Which means to me that if we want to add such a kwarg it should not be called
skipmissing
as I am afraid it would lead to a confusion. Also I would even say that this kwarg should be reserved forselect
andtransform
as incombine
you expect a different behaviour.In particular:
select(df, :a => mean, skipmissing = true)
and
select(df, :a => mean∘skipmissing)
would both work but produce different results.
Exactly. Even though I don’t like different keyword argument, I see the potential confusion for select/transform.
And I see the uses of both functionalities :smile:, simply they should have a different name. Actually we can have both, i.e. skipmissing
doing "hard" skipmissing
everywhere and something else which passes missings, and the name could be passmissing
as passmissing
in Missing.jl does exactly this kind of thing.
Yes, passmissing
would be a more accurate name. OTOH there would be reasons to use skipmissing
:
- Using the same name everywhere makes it easier to remember (as noted above in the discussion)
- In
combine
, since there's no correspondence between input and output rows, we cannot fill non-missing entries, so the only behavior we can implement isskipmissing
, notpassmissing
-- except for the special-case ofByRow
which isn't very useful withcombine
. Using consistent names betweencombine
andselect
/transform
sounds like a good thing, even if the behavior differs a bit.
Overall I'm not sure which choice is best. As suggested by @pdeffebach, we could have skipmissing
with a special case for functions that return a single scalar, in which case we would fill even rows with only missing values: this is justified because there's no correspondence between input and output rows in that case (like with combine
). We could also just throw an error telling to use mean∘skipmissing
for now.
I think I've found a possible solution. We could add a passmissing=true
argument to select
and transform
(as described above). The default to true
would be fine for 99% of use cases and would probably not break a lot of code (if any). It would also be consistent with the approach Julia takes regarding missing values: they either propagate or throw an error, but are never skipped silently by default.
Then we would also add a skipmissing=false
argument to combine
and filter
, consistent with groupby
. People would have to pass skipmissing=true
or use skipmissing(...)
manually. That's a bit annoying, but it's safe as missing values will never be dropped silently even if they were propagated silently due to the passmissing=true
default. Since passmissing=true
by default, users won't have to use that argument often (or never), so they won't be confused by the existence of both passmissing
and skipmissing
.
How does that sound? The main drawback of this approach is the inconsistency between select
/transform
and combine
, but that's probably OK as these are quite different operations anyway.
That’s interesting.
I was leaning on the opposite solution: skipmissing = true default for filter and combine, passmissing = false default for transform.
Skipmissing = true for filter is consistent with other languages (at least stata/dplyr/data.table), and that is what people want in 100% of cases.
Skipmissing = true for combine is consistent with Stata, panda. It is also very easy to do comprehend.
In contrast, passmissing = true in transform/select would be unique to Julia. It means that stuff like :x => ismissing will give missing instead of true for missing values. I still think it is worth having it but it may be harder to grasp conceptually.
So, in conclusion, my preferred solution would be skipmissing = true for combine and filter, and passmissing = true for select/transform. I get your point about compatibility with the spirit of missing in Base but I don’t see it as that important.
Sorry what is passmissing
? the concept of passmissing
only makes sense to me in ByRow
where you could conceivably have missing
as an input instead of a vector.
@pdeffebach I was using for the behavior I was describing for select/transform, whether it is actually called passmissing or skipmissing.
Skipmissing = true for combine is consistent with Stata, panda. It is also very easy to do comprehend.
I'm not sure that's true. The problem Milan is describing is
combine(groupby(df, :a), :x => t -> first, :y => t -> first, skipmissing = true)
In Stata you would just get missing
s if the first observation is missing
. But you would be confident that this was truly the first observation for every group.
Under your proposal, the functions would expand to be
combine(groupby(df, :a), :x => t -> first(skipmissing(t)), :y => t -> first(skipmissing(t))
If :x
has missing
as the first element, then you are going to get the second element of :x
and the first element of :y
. This will definitely lead to bugs.
That’s right, I did not think about that.
Note that, the way I understand it, first
in select, with passmissing =
true, would suffer for the same problem (ie it would give the first non
missing).
Can you think of other examples that may give surprising results in combine or select/transform? It would be nice to do a list (beyond first and ismissing).
On Tue, Aug 25, 2020 at 7:03 AM pdeffebach [email protected] wrote:
Skipmissing = true for combine is consistent with Stata, panda. It is also very easy to do comprehend.
I'm not sure that's true. The problem Milan is describing is
combine(groupby(df, :a), :x => t -> first, :y => t -> first, skipmissing = true)
In Stata you would just get missings if the first observation is missing. But you would be confident that this was truly the first observation for every group.
Under your proposal, the functions would expand to be
combine(groupby(df, :a), :x => t -> first(skipmissing(t)), :y => t -> first(skipmissing(t))
If :x has missing as the first element, then you are going to get the second element of :x and the first element of :y. This will definitely lead to bugs.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaData/DataFrames.jl/issues/2314#issuecomment-680044858, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPPPXNJZKXBPXFC4LORTWDSCPAB7ANCNFSM4OTTLDUA .
Whether skipmissing=true
take into account missings in all columns or only to those passed to each function is an interesting design decision to make. But AFAICT it's orthogonal to whether we should use passmissing=true
and skipmissing=false
by default or not. ~And it doesn't concern passmissing=true
, which would not apply to combine
~ EDIT: it does if we allow broadcasting a scalar result to all non-missing rows.
The main question for choosing the defaults is whether we favor safety or convenience. I'm always torn between these two goals, which is why I tried to proposed in the OP various solutions that don't require silently ignoring missing values by default.
Sorry what is
passmissing
? the concept ofpassmissing
only makes sense to me inByRow
where you could conceivably havemissing
as an input instead of a vector.
@pdeffebach It also works for any function passed to select
and transform
: we just pass a view of non-missing rows, and copy the resulting vector to the non-missing rows, filling with missing
other entries. This is what you proposed above.
I think @pdeffebach was just disagreeing with my claim that combine with skipmissing = true is what Stata does (while mean ignores missing values, first gets the first element of the column, whether it’s missing or not)
We could also define the kwarg maskmissing, and use it for filter/combine/transform, rather than using a combination of passmissing/skipmissing kwarg.
The definition of maskmissing is that it applies the operation on rows where columns are non missing. This explanation is consistent with what we want for filter/combine/transform, and so this would allow to get the same kwarg for all of these functions. (of course, for combine, it is exactly the same as skipmissing).
The thing is, people don't really complain about na.rm = TRUE
though, right? I feel like most of the problems we see about missing
are really easily fixed by pointing people to skipmissing
and passmissing
.
combine(gd, :col => (x -> sum(skipmissing(x)))
Is really ugly, yes, but in the future you will be able to write
using DataFramesMeta
@combine(gd, y = sum(skipmissing(col))
which is not that bad.
For me the problem is that for now it requires different strategies for filter (which requires coalesce), combine (which requires skipmissing). Another issue with combine is the whole thing about reduction other several vectors with different missing rows such as weighted mean, correlation etc )
Finally, transform requires sometimes passmissing, sometimes skipmissing, coalesce for ifelse calls, something
for tryparse
, etc. Also, there’s no good way currently to handle functions with vector input and vector output such as cumsum
.
.
That's fair. It would be hard to tackle each of these individually,
-
filter
should applycoalesce
by default. -
skipmissing = true
should applypassmissing
toByRow
functions - We need a lazy broadcasted
ifelse
that skips bad elements - Better
tryparse
or a wholedestring
package which would take a lot of work
Then there’s the whole thing about reduction other several vectors with different missing rows such as weighted mean, correlation etc
Also an issue, skipmissings
helps weighted mean
, but unfortunately cor
didn't make it into Statistics
.
Okay that's fair. I'm on board with a keyword argument with a variety of behaviors depending on the context.
I will take a defensive stance here.
Before I state what I think I will explain the principles of my judgement:
- DataFrames.jl should provide primitives that are flexible enough to perform what user needs in a performant way
- Other packages (like DataFramesMeta.jl or DataFramesMacros.jl) can add convenience wrappers; this does not mean that we do not want to be convenient in DataFrames.jl, but this is not a priority
- We take "safety first" and "consistence with Julia Base" policy
- We do not want to be breaking unless it is really justified
Given this:
filter
should apply coalesce by default.
I think we can add skipmissing
kwarg with false
as default. This is what Base does. If we can convince Julia Base to change the behaviour here then we will sync with that behaviour.
skipmissing = true
should applypassmissing
toByRow
functions
Do you mean it for select
/transform
or also for combine
? (also see my comment below)
- We need a lazy broadcasted
ifelse
that skips bad elements- Better
tryparse
or a wholedestring
package which would take a lot of work
Agreed, but it is out of scope of DataFrames.jl. I would also add x -> coalesce(x, false)
to the list with some convenient name, I think coalescefalse
is explicit but maybe you judge it to be too long. I would add this to Missings.jl.
but unfortunately
cor
didn't make it intoStatistics
.
I think it is an open issue how to do it rather than closed issue as there are many ways to compute cor
on a matrix with missings.
(x -> sum(skipmissing(x)))
this is normally written as sum∘skipmissing
and it is not that bad I think.
Now regarding the design the more I think about it the more I want do the following that gives a fine grained control of what is going on in the function in a single transformation:
For skipmissing
would not be a kwarg but it would be an extension to the minilanguage we have for select
, transform
and combine
.
If in the form:
source => fun => destination
if you wrap source
in skipmissing
then fun
in combine
, select
and transform
gets view
s of passed columns with rows containing missing
s skipped.
In general the idea is that source => fun∘skipmissing
is the same as skipmissing(source) => fun => destination
if source
is a single column, but we add this extension to allow easy handling of skipping rows in multiple-column scenarios
For filter
we do not add anything but just write users to use coalescefalse∘fun
if we agree to add such a function to Missings.jl.
For passmissing
It is not needed. Just tell users to write passmissing(their_function)
- this is exactly what you want in ByRow
and if I understand things correctly this is a main use case - is this correct?
I think this is a good proposal.
We might not even need skipmissing
in the meta-language. We can add a skipmissingsfun
wrapper similar to passmissing
that wraps all <:AbstractVector
inputs in skipmissings
. Though since it applies to the source I think a SkipMissing
is more logical.
@bkamins I am trying to understand your proposal:
- What are you proposing for filter? It sounds like you mention two different solutions.
- how would you handle transform with a function like cumsum?
filter
My comment in the first part was just that I do not want skipmissing=true
by default (no matter what syntax we apply).
My proposal is not to change filter
at all but define coalescefalse(x) = coalesce(x, false)
(the name can be discussed as this one is a bit long) and then just write:
filter(coalescefalse∘fun, df)
The beauty is that it would also work in Julia Base then which is a big benefit.
cumsum
I understand you want this:
julia> df = DataFrame(x=[1,2,missing,3])
4×1 DataFrame
│ Row │ x │
│ │ Int64? │
├─────┼─────────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ missing │
│ 4 │ 3 │
julia> transform(df, :x => (x -> cumsum(coalesce.(x, 0)) + x .* 0) => :y)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 3 │
│ 3 │ missing │ missing │
│ 4 │ 3 │ 6 │
Then I agree - I would not handle this in my proposal. I assume that passmissing
is mostly for ByRow
as noted above.
The point is that the use cases like cumsum
I think are quite rare (maybe I am wrong here).
The general idea is to try using function composition and higher order functions rather than keyword arguments as this seems more flexible for the future.
We can add a
skipmissingsfun
wrapper similar topassmissing
that wraps all<:AbstractVector
inputs inskipmissings
.
I was thinking about it, and initially judged that this is not possible as fun
in source => fun => destination
can get three things:
- a list of vectors as positional arguments
- a
NamedTuple
of vectors - an
AbstractDataFrame
But now as I think about it actually it is not a problem as this higher order function could just handle these three cases separately in if
-else
blocks. It is better than my original proposal as it is more composable (select
/transform
/combine
do not even need to know what it does - and this is good because it orthogonality in design makes it much easier to maintain in the long term) The question is what name could be used here. The issue is that skipmissing
is defined in Julia Base and adding skipmissing(::Base.Callable)
would be type piracy (though it is something that is invalid in Base currently - not sure what other name would be good though - maybe dropmissing
?)
@bkamins pointed me here, but unfortunately this thread is a bit longer than I've currently got time for. The reason I was pointed here was that I asked for comments on a proposal that touches on the topic of this thread: I am actually mostly in favour of a more cumbersome, force-the-user-to-think-carefully-and-be-explicit approach for the most part, my main gripe with missing
is when I filter a DataFrame (i.e. use ==
, >
, <
). In those cases, the additional verbosity required to get things to work is quite large, and to me at least the possibility of accidentally doing something unintended is low - I've never encountered a use case where in doing df.col .> 5
I wanted rows for which col
is missing
to be included in the result.
With these two considerations in mind, in my own code I started defining:
⋘(x, y) = (ismissing(x) | ismissing(y)) ? false : (x < y)
⋙(x, y) = (ismissing(x) | ismissing(y)) ? false : (x > y)
⩸(x, y) = (ismissing(x) | ismissing(y)) ? false : (x == y)
I clearly didn't spend much time choosing the opearators, I just wanted infix operators that look similar to but different from the regular comparison operators (one issue with my current selection is that their names are completely different: \verymuchless
, \ggg
and \equivDD
aren't exactly easy to memorize as a group of operators...)
In any case this relatively simple case for me solves the main usability issue I encounter in day-to-day work.
@bkamins you mentioned there might be a way to have ifelse, filter work with missing values in base. You really think so? Maybe that’s one way to go.
Maybe that’s one way to go.
I meant that this issue could be opened in Julia Base and discussed there. I feel that if it would get support it would be 2.0 release the earliest. Therefore I think we should have a solution for now in DataFrames.jl that does not rely on Julia Base.
(still - if you feel it is worth discussing please open an issue on Julia Base; I just recommend to be patient - core devs tend to be conservative and require quite a lot of justification for changes)
EDIT Actually - given our discussions apart from ifeslse
and filter
the third big thing is getindex
with Union{Missing, Bool}}
eltype selector
I agree that it would also be OK to keep DataFrames as a relatively low-level package which doesn't offer the most convenient syntax, as long as it provides the building blocks to do anything. Anyway it's clear that the "mini-language" will never be as intuitive as DataFramesMeta's macros. So maybe we should try to see at the same time how we want to make working with missing values easy in DataFramesMeta, and see what we need in DataFrames to allow implementing it.
Adding skipmissing(f::Callable)
to Base sounds doable. I'm less sure about filter
and getindex
skipping missing values, as this goes against what happens everywhere else (missing values are never dropped silently). But we don't need that for DataFrames/DataFramesMeta AFAICT.
The difficulty with finding a good solution in DataFramesMeta is that the equivalent implementations like dplyr rely on the fact that functions are vectorized and/or propagate missing values in totally ad-hoc way. For example, R/dplyr allows mutate(df, y = isna(x) ? "0" : as.character(x))
, where as.character
propagates missing values, but isna
does not. R/dplyr also allows mutate(df, y = x - mean(x))
since -
is vectorized, but we require .-
in Julia -- which is OK here but can get quite unwieldy with complex commands. The fact that Julia is much more systematic is a strength for advanced users, but it can also be a barrier for more basic use, as R usually does "the right thing" (but when it doesn't it's a pain).
Maybe we could handle this difficulty by having all macros in DataFramesMeta operate row-wise by default, and require a special syntax to indicate that a column should be accessed as a vector. For example, @transform(df, y = x - mean($x))
. Then it seems to me that the most intuitive behavior is this:
- all columns accessed as vectors (with
$
) are wrapped in a view of non-missing values by default (i.e.skipmissing=true
) - if some columns are accessed row-wise (no
$
), setpassmissing=true
(since we are in a case equivalent toByRow
) - if a function applied to a column vector (accessed using
$
) returns a vector (with its length equal to the number of non-missing rows) rather than a scalar, setpassmissing=true
AFAICT this would do "the right thing" in the following situations:
-
@transform(df, y = mean($x))
: assign to all rows (including those with missing values) the mean of non-missing values -
@transform(df, y = x - mean($x))
: subtract to each non-missing row the mean of non-missing values -
@transform(df, y = cumsum($x))
: assign to rows with non-missing values the cumulative sum of values for these rows -
@transform(df, y = s - cumsum($x))
: subtract to each non-missing row the cumulative sum of values for these rows
Cases that wouldn't do "the right thing" by default are uses of ismissing
(as missing values would always be skipped by default). We decide to be smart and set automatically passmissing=false
or skipmissing=false
when an expression contains ismissing
, or just throw an error with an informative message. A more tricky case are logical operators which implement three-valued logic: with passmissing=true
, true | missing
would be assigned missing
since |
would not actually be called. Maybe that's not a big deal.
I was hoping that DataFramesMeta would just be DataFrames + a macro to
create Pairs expressions (e.g. DataFramesMacros), but maybe that's the
wrong way to look at it. My concern is that having a barebone DataFrames
effectively creates 3 different syntaxes: the simple df.a
syntax, the
transform syntax, and the @transform
syntax.
I need to think more about these proposals. In the meantime, I just wanted
to point out that Stata has two different commands for row-wise vs,
column-wise transforms (gen v.s. egen). That is something to think about if
that would simplify thing, e.g. there could be a row version of
transform
, called, say, alter
.
On Wed, Aug 26, 2020 at 9:41 AM Milan Bouchet-Valat < [email protected]> wrote:
I agree that it would also be OK to keep DataFrames as a relatively low-level package which doesn't offer the most convenient syntax, as long as it provides the building blocks to do anything. Anyway it's clear that the "mini-language" will never be as intuitive as DataFramesMeta's macros. So maybe we should try to see at the same time how we want to make working with missing values easy in DataFramesMeta, and see what we need in DataFrames to allow implementing it.
Adding skipmissing(f::Callable) to Base sounds doable. I'm less sure about filter and getindex skipping missing values, as this goes against what happens everywhere else (missing values are never dropped silently). But we don't need that for DataFrames/DataFramesMeta AFAICT.
The difficulty with finding a good solution in DataFramesMeta is that the equivalent implementations like dplyr rely on the fact that functions are vectorized and/or propagate missing values in totally ad-hoc way. For example, R/dplyr allows mutate(df, y = isna(x) ? "0" : as.character(x)), where as.character propagates missing values, but isna does not. R/dplyr also allows mutate(df, y = x - mean(x)) since - is vectorized, but we require .- in Julia -- which is OK here but can get quite unwieldy with complex commands. The fact that Julia is much more systematic is a strength for advanced users, but it can also be a barrier for more basic use, as R usually does "the right thing" (but when it doesn't it's a pain).
Maybe we could handle this difficulty by having all macros in DataFramesMeta operate row-wise by default, and require a special syntax to indicate that a column should be accessed as a vector. For example, @transform(df, y = x - mean($x)). Then it seems to me that the most intuitive behavior is this:
- all columns accessed as vectors (with $) are wrapped in a view of non-missing values by default (i.e. skipmissing=true)
- if some columns are accessed row-wise (no $), set passmissing=true (since we are in a case equivalent to ByRow)
- if a function applied to a column vector (accessed using $) returns a vector (with its length equal to the number of non-missing rows) rather than a scalar, set passmissing=true
AFAICT this would do "the right thing" in the following situations:
- @transform(df, y = mean($x)): assign to all rows (including those with missing values) the mean of non-missing values
- @transform(df, y = x - mean($x)): subtract to each non-missing row the mean of non-missing values
- @transform(df, y = cumsum($x)): assign to rows with non-missing values the cumulative sum of values for these rows
- @transform(df, y = s - cumsum($x)): subtract to each non-missing row the cumulative sum of values for these rows
Cases that wouldn't do "the right thing" by default are uses of ismissing (as missing values would always be skipped by default). We decide to be smart and set automatically passmissing=false or skipmissing=false when an expression contains ismissing, or just throw an error with an informative message. A more tricky case are logical operators which implement three-valued logic: with passmissing=true, true | missing would be assigned missing since | would not actually be called. Maybe that's not a big deal.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaData/DataFrames.jl/issues/2314#issuecomment-680993848, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPPPXI2TMYOSLOB3XBVAWTSCU3KHANCNFSM4OTTLDUA .
@nalimilan - thank you for commenting on this. I guess it is best to move the discussion to DataFramesMeta.jl. My questions would be twofold:
- how would you turn-off this behaviour in DataFramesMeta.jl if it is not wanted
- in
s - cumsum($x)
I was not clear what would happen in the scenarios=[missing, missing, 1]
andx=[1,2,3]
.
@matthieugomez - I agree that we could have DataFramesMacros.jl (when it matures) be just an addition to DataFrames.jl and this should work. But for DataFramesMeta.jl I accept it adds new things (just as e.g. queryverse adds) - the user simply has to choose what one wants to use.
However, I feel that with proper composition/higher order functions we can get quite far. E.g. if we find short forms of what @nilshg proposes we can already solve most of the problems. I was thinking about it and even defining normal functions with names lt
, leq
, gt
, geq
, eq
defined like:
lt(x,y) = coalesce(x<y, false)
would be super convenient. I would propose to have them in Missings.jl. We would take hold of 5 very short names (bad thing) but would get flexibility in a place where it is much needed, also I think writing lt(x,y)
is not that much worse than x < y
. If we had these five functions I think that mostly we do not need coalescefalse
, as in other cases user can just write x -> coalesce(x, false)
since they will be rare anyway.
EDIT: if we went for this also neq
I agree with @matthieugomez, especially with upcoming multithreading in transform
I think it would be very hard to maintain feature parity if the two packages diverged too much.
I think @nalimilan's proposal is good, but is not functionally different from a DropMissing([:x1, :x2]) => fun
syntax that could go in the data frames mini-language.
I do agree with Milan that ByRow
being the default would solve a lot of these problems.
However, I feel that with proper composition/higher order functions we can get quite far. E.g. if we find short forms of what @nilshg proposes we can already solve most of the problems. I was thinking about it and even defining normal functions with names
lt
,leq
,gt
,geq
,eq
defined like:
I think a very smart macro could fix this easier.
@missingfalse x > y
I think a very smart macro could fix this easier.
@missingfalse x > y
this is clearly doable as it is the same as:
coalesce(x>y, false)
but just a bit longer to write :smile:
My point is - and this is how I understood what @nilshg wanted is to have something that is easy to type. And I feel that writing e.g. lt.(df.col1, df.col2)
is not much worse than df.col1 .< df.col2
and much better than @missingfalse df.col1 .< df.col2
or coalesce.(df.col1 .< df.col2, false)
.
I just wanted to point out that Stata has two different commands for row-wise vs, column-wise transforms (gen v.s. egen). That is something to think about if that would simplify thing, e.g. there could be a row version of
transform
, called, say,alter
.
@matthieugomez Indeed. My proposal is basically a way to combine Stata's gen
and egen
in a single expression, using $
to distinguish them. Anyway I think I would prefer specifying something like ByRow
in @transform
/@select
(whatever the exact syntax) than a separate function, as it makes it easier to understand the API than adding functions with totally different names.
Not allowing mixing column-wise and row-wise operation like in my proposal would certainly be simpler. The main drawback would be that one would have to use broadcasted operations, e.g. x .- mean(x)
rather than x - mean($x)
. Maybe that's OK but that increases the complexity.
* how would you turn-off this behaviour in DataFramesMeta.jl if it is not wanted
@bkamins I think you could just pass passmissing
and skipmissing
arguments to the macro, a bit like we discussed for functions.
* in `s - cumsum($x)` I was not clear what would happen in the scenario `s=[missing, missing, 1]` and `x=[1,2,3]`.
cumsum($x)
would become cumsum([1]) == [1]
, and then we would call cs = cumsum([1]); x -> s - cs
on each non-missing row (i.e. the last one with x == 3
), filling the first two rows with missing
.
However, I feel that with proper composition/higher order functions we can get quite far. E.g. if we find short forms of what @nilshg proposes we can already solve most of the problems. I was thinking about it and even defining normal functions with names
lt
,leq
,gt
,geq
,eq
I must say I'm not a fan of ad-hoc solutions like that. This would solve some of the issues, but not all of them and people will have to learn both the custom functions and the general mechanism using passmissing
. Also we would lose the nice x < y
syntax, which is really lousy for a high-level language. A macro like @missingfalse
would be better, though it's still a bit verbose for DataFramesMeta I would say.
I agree with @matthieugomez, especially with upcoming multithreading in
transform
I think it would be very hard to maintain feature parity if the two packages diverged too much.
@pdeffebach Yes that's a concern. Everything DataFramesMeta does should be translatable relatively directly to DataFrames commands so that we don't implement things twice.
I think @nalimilan's proposal is good, but is not functionally different from a
DropMissing([:x1, :x2]) => fun
syntax that could go in the data frames mini-language.I do agree with Milan that
ByRow
being the default would solve a lot of these problems.
Actually I think my proposal is a bit different, as you need to run first column-wise operations and then insert them in the row-wise function. I'm not sure how it could be translated into the select
/transform
syntax...
OK - given these comments can you explain why you think:
coalesce(x>y, false)
is bad?
I don't think it's bad as a low-level way of doing things, as it always works and is clear about what it does. I just think for DataFramesMeta we need a general solution which allows you to write the code as if there weren't any missing values (possibly adding an argument/macro to enable that behavior), so that you don't need to constantly adjust the code using various strategies which are hard to remember.
OK - for DataFramesMeta.jl this is fine. But then what for DataFrames.jl?
We started a discussion with adding skipmissing
and passmissing
keyword arguments. For me - from the discussion - it turns out that other options like coalesce(x>y, false)
are not that much longer and they are still clear what is going on as you noted. So is then your suggestion to leave things in DataFrames.jl as they are and just improve the documentation to teach users properly use:
-
skipmissing
-
pasmissing
-
coalesce
?
(maybe just adding the skipmissingfun
wrapper - probably with a shorter name - as @pdeffebach proposed as it cannot be handled using the options above when you pass multiple columns in source => fun => destination
)
proposed as it cannot be handled using the options above when you pass multiple columns in
source => fun => destination
)
It can! I wrote a PR for that in Missings that does this.
I know this PR, but it does not do what we need in DataFrames.jl:
get three things:
- a list of vectors as positional arguments
- a NamedTuple of vectors
- an AbstractDataFrame
and that solution does not handle them all unfortunately (otherwise this is exactly what we need :+1: )
What we could do is:
- add
NamedTuple
of vectors support in Missings.jl - @pdeffebach => would you agree to add it there - release Missings.jl
- add
AbstractDataFrame
support in DataFrames.jl - as this would not be a type piracy => @pdeffebach => also you are then welcome to add this method (it should be just aview
of this data frame removing rows with missings)
OK, so given also a short discussion with @pdeffebach on Slack:
- let us add methods to
skipmissings
and make a release of Missings.jl - I understand that @pdeffebach would start working on this. It would be good to do it relatively quickly, so that we can haveskipmissings
added also to DataFrames.jl in 0.22 release. - Independent of this a
SkipMissing
wrapper of the formSkipMissing(source) => fun => destination
can be added that will do the following:- take
source
columns - apply
skipmissings
to them - perform computation and pseudo broadcasting into length of the input after performing
skipmissings
- finally map the result of step 3 into original length of
source
leavingmissing
values in rows that were skipped byskipmissings
- take
The SkipMissing
wrapper would be supported only in select
and transform
and if GroupedDataFrame
was passed to it it would not employ the fast path (but I guess this is most likely not a use case for fast path anyway - right?).
Sounds like a good idea for DataFrames and probably for Tables.jl in general.
For DataFramesMeta, here's a simpler proposal than the one I made above, and which should be easy to implement using DataFrames functions. @transform
/@select
/@combine
would be either row-wise or column-wise, with some syntax like ByRow
/ByCol
, @byrow
/@bycol
or byrow=true
/bycol=true
to switch between these (not sure which one should be the default, that's a separate discussion). In some cases this is a bit annoying as you need to use broadcasting explicitly when operating column-wise, e.g. in @filter(df, x .> mean(x))
or @transform(df, y = x .- mean(x))
but maybe it's not the end of the world as these are not the most common operations.
Missing values would be propagated or skipped by default. For row-wise operation, (the equivalent of) passmissing=true
would be the default (and we could throw an error if ismissing
is used to help users). For column-wise operation, skipmissing=true
would be the default; additionally for @select
/@transform
, if the function returns a vector (of the same length as its input), the passmissing=true
behavior would be used. This will allow "doing the right thing" for both y = mean(x)
and y = x .- mean(x)
. So in practice ByRow(f)
and x -> f.(x)
would be equivalent, which is an important property. One would be able to pass passmissing
and skipmissing
manually as arguments to override the default. This would be needed notably when recoding missing values, using ismissing
or things like replace(x, missing => 0)
(which is actually the main drawback of propagating missing values).
Does that sound reasonable?
Independent of this a
SkipMissing
wrapper of the formSkipMissing(source) => fun => destination
can be added that will do the following:1. take `source` columns 2. apply `skipmissings` to them 3. perform computation and pseudo broadcasting into length of the input after performing `skipmissings` 4. finally map the result of step 3 into original length of `source` leaving `missing` values in rows that were skipped by `skipmissings`
@bkamins What you describe sounds useful in general, and in particular for my last DataFramesMeta proposal (though note I suggest filling all rows and not just non-missing rows when the function returns a scalar). But I think we should find another name for this, as I would expect SkipMissing(f)
to be equivalent to f ∘ skipmissing
. So this is unlikely to be accepted in Base. SkipAndFillMissing
would be accurate but a bit verbose. SkipFillMissing
? Or just FillMissing
?
The
SkipMissing
wrapper would be supported only inselect
andtransform
and ifGroupedDataFrame
was passed to it it would not employ the fast path (but I guess this is most likely not a use case for fast path anyway - right?).
The fast path only supports single-row reductions anyway, right? So that doesn't apply here.
@nalimilan - thank you for the comments.
I would expect
SkipMissing(f)
to be equivalent tof ∘ skipmissing
The point is that @pdeffebach proposed to have SkipMissing
not on fun
but on source
, so you would write SkipMissing([:x1, :x2]) => cor => :cor
for instance, and this allows us to define it more flexibly (but I agree we could also have a better name if we can think of any)
I suggest filling all rows and not just non-missing rows when the function returns a scalar
I was afraid that it would be confusing that returning a vector and returning a scalar behave differently. But maybe this is OK (on purpose I excluded combine
from the list as in that case we would hit many more problems, with select
and transform
things are easier).
I think we need to decide here which operation is more useful: your proposal or my proposal. I feel that my proposal gives a more consistent design, but maybe the usefulness of your proposal outweighs this shortcoming (but let us deeply discuss this, not to go back to it later - like we have recurring discussions if select
should be by row or by whole column by default - as when we make a decision it will be fixed for a long time - 😄).
In particular - given your semantics how would you achieve my result (i.e. fill only non-missing with the aggregate)?
The fast path only supports single-row reductions anyway, right? So that doesn't apply here.
If we accept your proposal (i.e. fill scalars everywhere) then there is no difference. If we accept my proposal (i.e. broadcast scalar only to non-missing slots) then there is a difference between:
select(df, :x , :y => sum ∘ skipmissing)
and
select(df, :x , SkipMissing(:y) => sum)
The point is that @pdeffebach proposed to have
SkipMissing
not onfun
but onsource
, so you would writeSkipMissing([:x1, :x2]) => cor => :cor
for instance, and this allows us to define it more flexibly (but I agree we could also have a better name if we can think of any)
OK. The problem is, SkipMissing(cols) => fun
already has a meaning, i.e. apply fun
to columns in vector cols
, excluding missing values if it contains some. Granted, it's a quite unlikely usage, but still...
In particular - given your semantics how would you achieve my result (i.e. fill only non-missing with the aggregate)?
In DataFramesMeta, you would just pass passmissing=true
explicitly. In DataFrames, you would use skipmissing
or skipmissings
manually instead of using SkipMissing(cols) => fun
/SkipFillMissing(cols) => fun
/whatever. In both, you could also return fill(value, length(x))
instead of just value
.
I agree this is an important point to settle. I suspect that in practice when doing reductions rows with missing values should get the same value as others. This is the case for example if you want to add a variable containing group means. That's what Stata's egen
does AFAICT (right @matthieugomez?). Can we find examples where we would like the opposite behavior (i.e. rows with missings are filled with missing
instead of with the returned scalar)?
SkipMissing(cols)
Ah - I have forgotten that it is in Base, but it is just unexported. Then yes - we should not change the meaning of this name.
In both, you could also return
fill(value, length(x))
instead of just value.
Fair point, except that in ByRow
you do not have an access to length(x)
, especially in GroupDataFrame
context. (but you can get it using other means anyway so it would be OK)
Can we find examples where we would like the opposite behavior
Thinking of it - you are probably right. As later if you use this reduction with missing
in the original column it will become missing
anyway.
The only reduction I can think of is:
transform(df, SkipMissing(:x) => x -> true)
where we get true
indicator for non-missing, but of course it does not make much sense to do such a transformation.
In Stata, mean of a variables creates a variable equal to the mean on non-missing rows, and missing values otherwise. Edit: no sorry it creases the mean for every rows.
I have also realized that, one issue with passmissing for functions that return vectors (as proposed by @nalimilan), is that lag will not work correctly.
is that lag will not work correctly.
can you please give an example what you exactly mean?
@matthieugomez I was trying to write a post comparing
gen y = mean(x)
gey y = mean(x) if !missing(x)
But I don't have a Stata installation at the moment. Can you confirm when the mean
is different?
I think the proposed behavior is fairly close to idiomatic stata. Enough that you can translate one-for-one, but I want to make sure .
@nalimilan Yes in the second case the mean is only given on rows that are not missing.
@bkamins sorry: I expect lag([1, missing]) to return [missing, 1]. I think with @nalimilan’s proposal and passmissing = true within a transform call, it would return [missing, missing]
sorry: I expect lag([1, missing]) to return [missing, 1]. I think with @nalimilan’s proposal and passmissing = true within a transform call, it would return [missing, missing]
That's fine, though. lag
doesn't error with missing
values so there is no need to do it (assuming we don't make DropMissing
the default, which I don't think we should yet).
lag
s are hard anyways, you have to remember grouping etc.
Yes, lag
/lead
/diff
are another case that would require opting-out explicitly of missing values propagation (if we make it the default). If we think this kind of situation (in addition to ismissing
, etc.) is too common or too confusing, we could avoid it by default and require people to use skipmissing=true
/ passmissing=true
explicitly. Anyway we could first try this system and later see whether it should be the default.
But for simplicity I think it would be better to have a single argument called skipmissing
that would do both, since we concluded that there are no major use cases in which passmissing
and skipmissing
would need to be separate. For row-wise operation, skipmissing
somewhat makes sense as a way to say "only apply the operation to non-missing rows (and fill others with missing
)". It's a slight abuse but probably OK. For column-wise operation, skipmissing
would be exactly what happens to input columns, and the passmissing
behavior would just be a consequence of that: if the function returns a single scalar, we can fill all rows, including missing ones (passmissing=false
in our terminology), but if it returns a vector of the same length as its input, we can only fill non-missing rows (passmissing=true
in our terminology).
(if we make it the default)
I understand you are talking about DataFramesMeta.jl here - right. In DataFrames.jl I do not think we will make it a default as it would be breaking.
But for simplicity I think it would be better to have a single argument called
Again - I understand you mean it for DataFramesMeta.jl - right? As for data frames, we want to have a wrapper around column selector in source => fun => destination
(we need a name as we cannot use SkipMissing
).
Regarding the discussion #2484.
I think that where
should have the following specification (I leave out some sanity checks but give a simplified idea of the impementation):
-
where(df::DataFrame, args...) = df[isequal.(.&(eachcol(select(df, args..., copycols=false))), true), :]
-
where(gdf::GroupedDataFrame, args...) = parent(df)[isequal.(.&(eachcol(select(gdf, args..., copycols=false, keepkeys=false))), true), :]
An element of args
can only be of the form cols => fun
and we transform it to cols => fun => x_i
(to make sure we have unique output column names).
Regarding the comment by @nalimilan:
One potential issue is that having
where
skipmissing
values but other functions likecombine
,select
andtransform
not skip them
I do not propose to skip missing values. What I propose is to use isequal
not ==
for testing (and this is a logically valid rule). Actually in my codes I normally use isequal
because of this for logical conditions as a rule.
We could add a errror_on_missing
(or something similar) kwarg to where
to choose if we use isequal
or ==
to do the test (but I would still make it default to false
, i.e. use isequal
by default)
You need gdf
above, as transformations are on the group level when given a grouped data frame above.
I think we can use isequal
without addressing transform
and select
. We aren't skipping missing during the transformation, just when we choose which rows to keep.
I don't think we are ready to address transform
and select
yet. There hasn't been consensus on behavior. better to implement something in Missings.jl here and give users a chance to see if they like it.
You need
gdf
above
fixed, it was a typo
+1. It would also be nice to have a where!
version, as well as a view
kwarg for where
.
Last thing is the name. I like where
. The only issue is that it is used
for special syntax in Julia. Is that a problem? In particular, I think it
creates some issues with using Lazy’s macro @>
, but maybe it can be
fixed.
On Mon, Oct 19, 2020 at 6:27 AM Bogumił Kamiński [email protected] wrote:
You need gdf above
fixed, it was a typo
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaData/DataFrames.jl/issues/2314#issuecomment-712156073, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPPPXPIIASGWALSXHZIEZ3SLQ5FXANCNFSM4OTTLDUA .
It would also be nice to have a
where!
version, as well as aview
kwarg forwhere
.
Sure - we would mirror that - thank you for raising this, as it is easy to forget.
I like
where
I think if we like it (and I guess we do not have a better name) then we should try to make sure that the "ecosystem" works with it. For sure this is not a problem for DataFramesMeta.jl. @pdeffebach - have you experimented with this name in DataFramesMeta.jl.
where
works with linq
, but the Lazy pipings all work with @where
so there hasn't been a conflict.
I do not propose to skip missing values. What I propose is to use
isequal
not==
for testing (and this is a logically valid rule). Actually in my codes I normally useisequal
because of this for logical conditions as a rule.We could add a
errror_on_missing
(or something similar) kwarg towhere
to choose if we useisequal
or==
to do the test (but I would still make it default tofalse
, i.e. useisequal
by default)
Well I'd say you're playing with words. :-) Selecting rows for with isequal(x, true)
is really skipping rows for which x
is missing
. And I would be inclined to call the argument skipmissing=true
instead of errror_on_missing=false
(that would be more standard given our terminology).
TBH I'm not really sure I'm opposed to doing that, but I feel it's a bit inconsistent to consider that where
shouldn't throw an error in the presence of missing values (when you don't handle them manually), but that combine
/select
/transform
should. Of course it's much simpler and obvious what to do to handle missing values with where
, so it's easier and less risky to implement. But in both cases 1) you lose some safety if you didn't expect missing values to be present and they happen to be there, and 2) it's still painful to work with combine
/select
/transform
in the presence of missing values.
Is not it what dplyr does already (ie skip missing in filter but not in mutate)?
On Mon, Oct 19, 2020 at 1:13 PM Milan Bouchet-Valat < [email protected]> wrote:
I do not propose to skip missing values. What I propose is to use isequal not == for testing (and this is a logically valid rule). Actually in my codes I normally use isequal because of this for logical conditions as a rule.
We could add a errror_on_missing (or something similar) kwarg to where to choose if we use isequal or == to do the test (but I would still make it default to false, i.e. use isequal by default)
Well I'd say you're playing with words. :-) Selecting rows for with isequal(x, true) is really skipping rows for which x is missing. And I would be inclined to call the argument skipmissing=true instead of errror_on_missing=false (that would be more standard given our terminology).
TBH I'm not really sure I'm opposed to doing that, but I feel it's a bit inconsistent to consider that where shouldn't throw an error in the presence of missing values (when you don't handle them manually), but that combine/select/transform should. Of course it's much simpler and obvious what to do to handle missing values with where, so it's easier and less risky to implement. But in both cases 1) you lose some safety if you didn't expect missing values to be present and they happen to be there, and 2) it's still painful to work with combine/select/transform in the presence of missing values.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaData/DataFrames.jl/issues/2314#issuecomment-712415696, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPPPXNZOOAMIVCDP2VM3TLSLSMX3ANCNFSM4OTTLDUA .
Well I'd say you're playing with words. :-)
The point is that we are not skipping missing in the sense how e.g. select
would skip them. As for select
skipmissing
means REMOVE FROM THE COLLECTION or IGNORE THEM while in where
it is TREAT THEM AS FALSE. And this is the point of my reasoning.
This is a different operation that we perform (skipping missings in the sense of select
in where
would make no sense).
skipmissing=true
For the above reason I have not proposed skipmissing
kwarg but some other name (of course I am not attached to the proposed one as it is not nice obviously).
My point is that the decision what to do in select
and how to implement where
are logically orthogonal although they have very similar end effect.
Regarding where
for GroupedDataFrame
we have two options:
- it would produce rows in the order of the
parent
(i.e. it would internally rely onselect
; in this scenario subsetting groups inGroupedDataFrame
is not allowed) - it would produce rows in the order of groups (i.e. it would internally rely on
combine
; in this scenario subsetting groups inGroupedDataFrame
is allowed)
Which one do we prefer?
- As it's what I've already implemented in DataFramesMeta. But in general i think its good to not re-order things too much, and this is way closer to the mental model of
select
thenfilter
.