DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Skipping missing values more easily

Open nalimilan opened this issue 3 years ago • 83 comments

It seems that dealing with missing values is one of the most painful issues we have, which goes against the very powerful and convenient DataFrames API. Having to write things like filter(:col => x -> coalesce(x > 1, false), df) or combine(gd, :col => (x -> sum(skipmissing(x))) isn't ideal. One proposal to alleviate this is https://github.com/JuliaData/DataFrames.jl/issues/2258: add a skipmissing argument to functions like filter, select, transform and combine to unify the way one can skip missing values, instead of having to use different syntaxes which are hard to grasp for newcomers and make the code more complex to read.

That would be one step towards being more user-friendly, but one would still have to repeat skipmissing=true all the time when dealing with missing values. I figured two solutions could be considered to improve this:

  • Have DataFramesMeta simplify the handling of missing values. This could be via a dedicated e.g. @linqskipmissing macro or a statement like skipmissing within a @linq block that would automatically pass skipmissing=true to all subsequent operations in a chain or block. This wouldn't really help with operations outside such blocks though.
  • Have a field in DataFrame objects that would store the default value to use for the skipmissing argument. By default it would be false, so that you get the current behavior, which ensures safety via propagation of missing values. But when you know you are working with a data set with missing values, you would be able to call skipmissing!(df, true) once and then avoid repeating it.

Somewhat similar discussions have happened a long time ago (but at the array rather than the data frame level) at https://github.com/JuliaStats/DataArrays.jl/issues/39. I think it's fair to say that we know have enough experience now to make a decision. One argument against implementing this at the DataFrame level is that it will have no effect on operations applied directly to column vectors, like sum(df.col). But that's better than nothing.

Cc: @bkamins, @matthieugomez, @pdeffebach, @mkborregaard

nalimilan avatar Jul 07 '20 20:07 nalimilan

  • Have DataFramesMeta simplify the handling of missing values. This could be via a dedicated e.g. @linqskipmissing macro or a statement like skipmissing within a @linq block that would automatically pass skipmissing=true to all subsequent operations in a chain or block. This wouldn't really help with operations outside such blocks though.

This is a good idea, however I also like skipmissing = true at the level of a transform call or even at the level of argument because it's explicit.

Perhaps DataFramesMeta could provide a block-level skipmissing option as well as a macro like

@transform(df, @sm y = :x1 .+ mean(:x2))

Where @sm macro, does the necessary transformation described in #2258

function wrapper(fun, x, y)
    sx, sy = Missings.skipmissings(x, y) # need to be on Missings master
    sub_out = fun(sx, sy)
    full_out = Vector{Union{eltype(sub_out), Missing}}(missing, length(x))
    full_out[eachindex(sx)] .= sub_out # eachindex(sx) returns indices of complete cases

    return full_out
end
  • Have a field in DataFrame objects that would store the default value to use for the skipmissing argument. By default it would be false, so that you get the current behavior, which ensures safety via propagation of missing values. But when you know you are working with a data set with missing values, you would be able to call skipmissing!(df, true) once and then avoid repeating it.

I'm not a fan of this idea since it's behavior depending on a global state that could be set a long ways from where the transform call is. It seems like it could cause debugging to be a pain.

pdeffebach avatar Jul 09 '20 16:07 pdeffebach

It's great you're thinking about how to make working with missing values easier! I 100% agree.

A macro may be good.

I'm not sure I'm convinced by an option at the level of the dataframe. It sounds complicated to keep track of it. For instance, won't people be confused that stuff like df = merge(df, df_using) does not retain the option?

A third possibility would be to allow users to change the default option for transform, etc, say by writing SKIPMISSING = true at the start of a script.

matthieugomez avatar Jul 09 '20 20:07 matthieugomez

To throw in a much more crazy idea: could we use contextual dispatch to override the behavior of all reductions and comparisons within a whole expression so that they are "missing-permissive"?

I've long been frustrated with the difficulty of working with missing given that it infects all downstream operations and wished it worked differently in Base. (I acknowledge my frustration could be misguided — perhaps it's saved me from some horrible bugs without knowing it :-) )

c42f avatar Jul 10 '20 02:07 c42f

Perhaps DataFramesMeta could provide a block-level skipmissing option as well as a macro like

@transform(df, @sm y = :x1 .+ mean(:x2))

Where @sm macro, does the necessary transformation described in #2258

@pdeffebach Yes that was more or less what I had in mind. Though if you have to repeat this for each call it's not much better than passing skipmissing=true. Being able to apply it to a chain of operations would already be more useful.

I'm not a fan of this idea since it's behavior depending on a global state that could be set a long ways from where the transform call is. It seems like it could cause debugging to be a pain.

@pdeffebach Yes. OTOH it's not so different from e.g. creating a column: if you get an error because it doesn't exist or it's incorrect you have to find where it's been defined, which can be quite far from where the error happens.

I'm not sure I'm convinced by an option at the level of the dataframe. It sounds complicated to keep track of it. For instance, won't people be confused that stuff like df = merge(df, df_using) does not retain the option?

A third possibility would be to allow users to change the default option for transform, etc, say by writing SKIPMISSING = true at the start of a script.

@matthieugomez Yes, losing the option after transformations could be annoying. Though it could be propagated across some operations which always preserve missing values: in these cases there's little point in forcing you to repeat that you know there are missing values. Where safety checks matter is when you could believe you got rid of missing values and for some reason it's not the case. But I admit that this option would decrease safety (even if not propagated automatically) as you could pass a data frame to a function which isn't prepared to handle missing values and it would silently skip them.

A global setting would only aggravate these issues IMHO since it would affect completely unrelated operations, possibly in packages, some of which may rely on the implicit skipmissing=false.

To throw in a much more crazy idea: could we use contextual dispatch to override the behavior of all reductions and comparisons within a whole expression so that they are "missing-permissive"?

@c42f My suggestion about DataFramesMeta is a kind of limited way to change the behavior of a whole block. Using Cassette.jl (which I guess is what you mean by "contextual dispatch"?) would indeed be a more general approach and it has been mentioned several times. Actually I just tried that and it turns out to be very simple to make it propagate missing values:

using Cassette, Missings

Cassette.@context PassMissingCtx

Cassette.overdub(ctx::PassMissingCtx, f, args...) = passmissing(f)(args...)

# do not lift special functions which already handle missing
for f in (:ismissing, :(==), :isequal, :(===), :&, :|, :⊻)
    @eval begin
        Cassette.overdub(ctx::PassMissingCtx, ::typeof($f), args...) = $f(args...)
    end
end

f(x) = x > 0 ? log(x) : -Inf

julia> Cassette.@overdub PassMissingCtx() f(missing)
missing

julia> Cassette.@overdub PassMissingCtx() f(1)
0.0

julia> Cassette.@overdub PassMissingCtx() ismissing(missing)
true

julia> Cassette.@overdub PassMissingCtx() missing | true
true

Skipping missing values in reductions will be a little harder, but it's doable if we only want to handle a known list of functions. For example this quick hack works:

Cassette.overdub(ctx::PassMissingCtx, ::typeof(sum), x) = sum(skipmissing(x))

julia> x = [1, missing];

julia> Cassette.@overdub PassMissingCtx() sum(x)
1

Maybe this kind of thing could be made simpler to use by providing a macro like @passmissing as a shorthand for Cassette.@overdub PassMissingCtx(). But it's important to measure all the implications of this approach: since it will affect all function calls deep in the code that you didn't write yourself, it will have lots of unintended side effects. For example, Cassette.@overdub PassMissingCtx() [missing] gives missing, which will break lots of package code. A safer approach would be to only apply passmissing to a whitelist of functions for which it makes sense (scalar functions, mainly), like DataValues does -- with the drawback that the list is kind of arbitrary.

In the end, maybe Cassette is too powerful for what we actually need in the context of DataFrames. In practice with select/transform/combine it makes sense to only apply passmissing to the top-level functions, rather than recursively. And for reductions, passing views of complete rows as proposed at https://github.com/JuliaData/DataFrames.jl/issues/2258 should be enough, when combined with a convenient DataFramesMeta syntax.

nalimilan avatar Jul 10 '20 09:07 nalimilan

My suggestion about DataFramesMeta is a kind of limited way to change the behavior of a whole block. Using Cassette.jl (which I guess is what you mean by "contextual dispatch"?) would indeed be a more general approach and it has been mentioned several times. Actually I just tried that and it turns out to be very simple to make it propagate missing values:

Very cool. IIUC people have started to use other libraries like IRTools which plug into the compiler in a similar way to Cassette but are not Cassette itself, but yes, that's roughly what I had in mind. One downside is that it's a pretty heavy weight tool to deploy for something like missing.

since it will affect all function calls deep in the code that you didn't write yourself, it will have lots of unintended side effects

I think this is the bigger problem; it's not clear how deep to recurse and it could definitely have unintended consequences. It's a very similar problem with floating point rounding modes, where having the rounding mode as dynamic state really doesn't work well as it infects other calculations which weren't programmed to deal with a different rounding mode.

So I agree it does seem much safer and more sensible to have a macro which lowers only the syntax within the immediate expression to be more permissive with missing. Actually I wonder whether you could do something more like broadcast lowering where all function calls within the expression are lifted in a certain way such that dispatch can be customized, so, eg, @linq f(x,y,z) becomes linq_missing(f, x, y, z).

In terms of your examples,

@linq filter(:col =>(x -> x > 1), df)
# means
linq_missing(filter, :col => (x -> linq_missing(>, x, 1)), df)

@linq combine(gd, :col => (x -> sum(x))
# means
linq_missing(combine, gd, :col => (x -> linq_missing(sum, x)))

Something like this has very regular lowering rules and gives a measure of extensibility for user defined functions.

c42f avatar Jul 11 '20 12:07 c42f

One alternative is that filter/transform/combine could always skip missing (i.e. kwarg skipmissing = true is the default).

matthieugomez avatar Aug 02 '20 18:08 matthieugomez

Can you please help me understanding this proposal in detail? What should be the outcome of the following operations if we added skipmissing option (now I am showing what we get now and would like to understand what we should get if we add skipmissing=true):

julia> df = DataFrame(x=[1,missing,2])
3×1 DataFrame
│ Row │ x       │
│     │ Int64?  │
├─────┼─────────┤
│ 1   │ 1       │
│ 2   │ missing │
│ 3   │ 2       │

julia> select(df, :x)
3×1 DataFrame
│ Row │ x       │
│     │ Int64?  │
├─────┼─────────┤
│ 1   │ 1       │
│ 2   │ missing │
│ 3   │ 2       │

julia> select(df, :x => identity)
3×1 DataFrame
│ Row │ x_identity │
│     │ Int64?     │
├─────┼────────────┤
│ 1   │ 1          │
│ 2   │ missing    │
│ 3   │ 2          │

julia> select(df, :x => ByRow(identity))
3×1 DataFrame
│ Row │ x_identity │
│     │ Int64?     │
├─────┼────────────┤
│ 1   │ 1          │
│ 2   │ missing    │
│ 3   │ 2          │

julia> combine(df, :x)
3×1 DataFrame
│ Row │ x       │
│     │ Int64?  │
├─────┼─────────┤
│ 1   │ 1       │
│ 2   │ missing │
│ 3   │ 2       │

julia> combine(df, :x => identity)
3×1 DataFrame
│ Row │ x_identity │
│     │ Int64?     │
├─────┼────────────┤
│ 1   │ 1          │
│ 2   │ missing    │
│ 3   │ 2          │

julia> combine(df, :x => ByRow(identity))
3×1 DataFrame
│ Row │ x_identity │
│     │ Int64?     │
├─────┼────────────┤
│ 1   │ 1          │
│ 2   │ missing    │
│ 3   │ 2          │

bkamins avatar Aug 06 '20 16:08 bkamins

AFAICT this question regards #2258, better keep matters separate as discussions are already tricky enough.

nalimilan avatar Aug 06 '20 17:08 nalimilan

With select, skipmissing would apply the functions on row for which none of the argument is missing, return the output on these rows, and missing otherwise (to maintain the same number of rows as the original dataframe). So skipmissing = true would not change anything in your example. Similarly for transform.

With combine, skipmissing would apply the functions on rows for which none of the argument is missing.

julia> combine(df, :x)
3×1 DataFrame
│ Row │ x       │
│     │ Int64?  │
├─────┼─────────┤
│ 1   │ 1       │
│ 2   │ 2       │

julia> combine(df, :x => identity)
3×1 DataFrame
│ Row │ x_identity │
│     │ Int64?     │
├─────┼────────────┤
│ 1   │ 1          │
│ 2   │ 2          │

julia> combine(df, :x => ByRow(identity))
3×1 DataFrame
│ Row │ x_identity │
│     │ Int64?     │
├─────┼────────────┤
│ 1   │ 1          │
│ 2   │ 2          │

That being said, I am not sure why combine(df, :x) or combine(df, :x => ByRow(.)) are accepted and it sounds that maybe it would be clearer to disallow them?

matthieugomez avatar Aug 06 '20 17:08 matthieugomez

Ah - sorry - I thought this issue "overriden" that one. So actually this issue should be on hold till we resolve #2258, because it only asks to make #2258 simpler - right?

bkamins avatar Aug 06 '20 17:08 bkamins

Could we think about making filter/transform/combine/select have the kwarg skipmissing = true by default? It's not unheard of — that's what Stata and Panda (for combine) do. Also, data.table and dplyr automatically skip missing in their versions of filter.

matthieugomez avatar Aug 22 '20 16:08 matthieugomez

I would not have a problem with this. We already have it in groupby so it would be consistent. So I assume you want:

  1. for filter wrap a predicate p in x -> coalesce(p(x), false)
  2. for transform/combine/select we have two decisions:
    • do we pass views or skipmissing wrappers? (given the second question probably views, it is going to be a bit slow, but I think skipmissing=true does not have to be super fast as it is a convenience wrapper)
    • what do we do if multiple columns are passed - skip rows that have at least one missing? (this is what groupby does)

bkamins avatar Aug 22 '20 19:08 bkamins

Regarding 2, note that as discussed at https://github.com/JuliaData/DataFrames.jl/issues/2258 we have to pass views for select and transform as we need to be able to reassign values to the non-missing rows in the input. When multiple columns are passed, we should only keep complete observations, otherwise something like [:x, :y] => + wouldn't work due to different lengths.

nalimilan avatar Aug 22 '20 20:08 nalimilan

I should have noted that before thinking about making it the default, we should first implement this keyword argument and see how it goes.

nalimilan avatar Aug 22 '20 20:08 nalimilan

@nalimilan Yes starting with a kwarg and see how it goes is the right way. The only reason I was mentioning the default value is that 1.0 may mean that this kind of stuff won't be able to change later on — I hope it's not the case.

@bkamins I agree with 1 and 2.1 (views). I think for the case of [:x, :y] => + it should skip missing on both (as @nalimilan points out), but for [:x, :y] .=> mean, or for :x => mean, :y => mean, it should skipmissing on x and y separately.

matthieugomez avatar Aug 22 '20 21:08 matthieugomez

Yes - for [:x, :y] .=> mean this is separate (which means that transform/select will throw an error in cases when there is no match and a vector is returned).

OK - so it seems we have a consensus here. I will propose a PR.

bkamins avatar Aug 22 '20 22:08 bkamins

Just a maintenance question - when we add skipmissing kwarg should both #2314 and #2258 be closed or something more should be done/kept track of?

bkamins avatar Aug 22 '20 22:08 bkamins

That would be awesome, thanks! Jus to make sure I understand, what do you mean by `transform/select will throw an error in cases when there is no match and a vector is returned).'?

If this happens, https://github.com/JuliaData/DataFrames.jl/issues/2258 should definitely be closed, but this thread should be open if the default value is false, since it may still be cumbersome to add it at every command.

matthieugomez avatar Aug 22 '20 22:08 matthieugomez

transform/select will throw an error in cases when there is no match and a vector is returned

I mean tat this would still error:

julia> df = DataFrame(a=[1,missing,2], b=[1,missing,missing])
3×2 DataFrame
│ Row │ a       │ b       │
│     │ Int64?  │ Int64?  │
├─────┼─────────┼─────────┤
│ 1   │ 1       │ 1       │
│ 2   │ missing │ missing │
│ 3   │ 2       │ missing │

julia> select(df, :a => collect∘skipmissing)
ERROR: ArgumentError: length 2 of vector returned from function #62 is different from number of rows 3 of the source data frame.

(but I guess this is natural to do it this way)

this thread should be open if the default value is false

OK - we can keep it open then. For 1.0 release the default will be false to be non-breaking.

bkamins avatar Aug 22 '20 22:08 bkamins

That would be awesome, thanks!

Started thinking about it :). The trickiest part will be fast aggregation functions for GroupedDataFrame case (as usual), but also they should be doable.

bkamins avatar Aug 22 '20 22:08 bkamins

transform/select will throw an error in cases when there is no match and a vector is returned

I mean tat this would still error:

julia> df = DataFrame(a=[1,missing,2], b=[1,missing,missing])
3×2 DataFrame
│ Row │ a       │ b       │
│     │ Int64?  │ Int64?  │
├─────┼─────────┼─────────┤
│ 1   │ 1       │ 1       │
│ 2   │ missing │ missing │
│ 3   │ 2       │ missing │

julia> select(df, :a => collect∘skipmissing)
ERROR: ArgumentError: length 2 of vector returned from function #62 is different from number of rows 3 of the source data frame.

(but I guess this is natural to do it this way)

Just to clarify, I think the following should work:

select(df, :a => collect, skipmissing = true)

and actually also

select(df, :a => collect∘skipmissing , skipmissing = true)

since the function collect∘skipmissing is one to one on the set of values for which a is not missing. In both cases, it just returns the same thing as a.

However, the following should error

combine(df, :a => collect, :b => collect, skipmissing = true)

because the length of collect∘skipmissing(a) is different from the length of collect∘skipmissing(b)

matthieugomez avatar Aug 22 '20 22:08 matthieugomez

Just to clarify, I think the following should work:

select(df, :a => collect, skipmissing = true)

Thank you for commenting on this, as (sorry if I have forgotten some earlier discussions) you want to:

select(df, :a => collect , skipmissing = true)

to work but

select(df, :a => collect∘skipmissing)

will fail.

Which means to me that if we want to add such a kwarg it should not be called skipmissing as I am afraid it would lead to a confusion. Also I would even say that this kwarg should be reserved for select and transform as in combine you expect a different behaviour.

In particular:

select(df, :a => mean, skipmissing = true)

and

select(df, :a => mean∘skipmissing)

would both work but produce different results.

bkamins avatar Aug 23 '20 07:08 bkamins

My proposal above about "spreading" missing values seems relevant here.

select(df, :a => collect, skipmissing = true)

This takes :a, applies skipmissing, collects the result, then loops through indices of :a and, when not missing`, fills it in.

select(df, :a => collect∘skipmissing , skipmissing = true)

This takes :a, applies skipmissing, then applies it again, collects the result, and loops through indices of :a as before.

select(df, :a => mean, skipmissing = true)

Takes :a, applies skipmissing, takes the mean, then loops through and fills in indices of :a where indices of :a are not missing. Perhaps we can make an exception for scalar values? If a scalar is returned, it gets spread across the entire vector, regardless of missing values? That way it would match

select(df, :a => mean∘skipmissing; skipmissing = false)

EDIT: After playing around with R, now I'm not so sure. I think the biggest annoyance is with filter currently and we should start with that since everyone agrees on that behavior and we are confident it won't result in unpredictable / inconsistent behavior.

pdeffebach avatar Aug 23 '20 12:08 pdeffebach

Just to clarify, I think the following should work: select(df, :a => collect, skipmissing = true)

Thank you for commenting on this, as (sorry if I have forgotten some earlier discussions) you want to:

select(df, :a => collect , skipmissing = true)

to work but

select(df, :a => collect∘skipmissing)

will fail.

Which means to me that if we want to add such a kwarg it should not be called skipmissing as I am afraid it would lead to a confusion. Also I would even say that this kwarg should be reserved for select and transform as in combine you expect a different behaviour.

In particular:

select(df, :a => mean, skipmissing = true)

and

select(df, :a => mean∘skipmissing)

would both work but produce different results.

Exactly. Even though I don’t like different keyword argument, I see the potential confusion for select/transform.

matthieugomez avatar Aug 23 '20 16:08 matthieugomez

And I see the uses of both functionalities :smile:, simply they should have a different name. Actually we can have both, i.e. skipmissing doing "hard" skipmissing everywhere and something else which passes missings, and the name could be passmissing as passmissing in Missing.jl does exactly this kind of thing.

bkamins avatar Aug 23 '20 18:08 bkamins

Yes, passmissing would be a more accurate name. OTOH there would be reasons to use skipmissing:

  • Using the same name everywhere makes it easier to remember (as noted above in the discussion)
  • In combine, since there's no correspondence between input and output rows, we cannot fill non-missing entries, so the only behavior we can implement is skipmissing, not passmissing -- except for the special-case of ByRow which isn't very useful with combine. Using consistent names between combine and select/transform sounds like a good thing, even if the behavior differs a bit.

Overall I'm not sure which choice is best. As suggested by @pdeffebach, we could have skipmissing with a special case for functions that return a single scalar, in which case we would fill even rows with only missing values: this is justified because there's no correspondence between input and output rows in that case (like with combine). We could also just throw an error telling to use mean∘skipmissing for now.

nalimilan avatar Aug 23 '20 20:08 nalimilan

I think I've found a possible solution. We could add a passmissing=true argument to select and transform (as described above). The default to true would be fine for 99% of use cases and would probably not break a lot of code (if any). It would also be consistent with the approach Julia takes regarding missing values: they either propagate or throw an error, but are never skipped silently by default.

Then we would also add a skipmissing=false argument to combine and filter, consistent with groupby. People would have to pass skipmissing=true or use skipmissing(...) manually. That's a bit annoying, but it's safe as missing values will never be dropped silently even if they were propagated silently due to the passmissing=true default. Since passmissing=true by default, users won't have to use that argument often (or never), so they won't be confused by the existence of both passmissing and skipmissing.

How does that sound? The main drawback of this approach is the inconsistency between select/transform and combine, but that's probably OK as these are quite different operations anyway.

nalimilan avatar Aug 25 '20 08:08 nalimilan

That’s interesting.

I was leaning on the opposite solution: skipmissing = true default for filter and combine, passmissing = false default for transform.

Skipmissing = true for filter is consistent with other languages (at least stata/dplyr/data.table), and that is what people want in 100% of cases.

Skipmissing = true for combine is consistent with Stata, panda. It is also very easy to do comprehend.

In contrast, passmissing = true in transform/select would be unique to Julia. It means that stuff like :x => ismissing will give missing instead of true for missing values. I still think it is worth having it but it may be harder to grasp conceptually.

So, in conclusion, my preferred solution would be skipmissing = true for combine and filter, and passmissing = true for select/transform. I get your point about compatibility with the spirit of missing in Base but I don’t see it as that important.

matthieugomez avatar Aug 25 '20 13:08 matthieugomez

Sorry what is passmissing? the concept of passmissing only makes sense to me in ByRow where you could conceivably have missing as an input instead of a vector.

pdeffebach avatar Aug 25 '20 13:08 pdeffebach

@pdeffebach I was using for the behavior I was describing for select/transform, whether it is actually called passmissing or skipmissing.

matthieugomez avatar Aug 25 '20 13:08 matthieugomez

Skipmissing = true for combine is consistent with Stata, panda. It is also very easy to do comprehend.

I'm not sure that's true. The problem Milan is describing is

combine(groupby(df, :a), :x => t -> first, :y => t -> first, skipmissing = true)

In Stata you would just get missings if the first observation is missing. But you would be confident that this was truly the first observation for every group.

Under your proposal, the functions would expand to be

combine(groupby(df, :a), :x => t -> first(skipmissing(t)), :y => t -> first(skipmissing(t))

If :x has missing as the first element, then you are going to get the second element of :x and the first element of :y. This will definitely lead to bugs.

pdeffebach avatar Aug 25 '20 14:08 pdeffebach

That’s right, I did not think about that.

Note that, the way I understand it, first in select, with passmissing = true, would suffer for the same problem (ie it would give the first non missing).

Can you think of other examples that may give surprising results in combine or select/transform? It would be nice to do a list (beyond first and ismissing).

On Tue, Aug 25, 2020 at 7:03 AM pdeffebach [email protected] wrote:

Skipmissing = true for combine is consistent with Stata, panda. It is also very easy to do comprehend.

I'm not sure that's true. The problem Milan is describing is

combine(groupby(df, :a), :x => t -> first, :y => t -> first, skipmissing = true)

In Stata you would just get missings if the first observation is missing. But you would be confident that this was truly the first observation for every group.

Under your proposal, the functions would expand to be

combine(groupby(df, :a), :x => t -> first(skipmissing(t)), :y => t -> first(skipmissing(t))

If :x has missing as the first element, then you are going to get the second element of :x and the first element of :y. This will definitely lead to bugs.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaData/DataFrames.jl/issues/2314#issuecomment-680044858, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPPPXNJZKXBPXFC4LORTWDSCPAB7ANCNFSM4OTTLDUA .

matthieugomez avatar Aug 25 '20 14:08 matthieugomez

Whether skipmissing=true take into account missings in all columns or only to those passed to each function is an interesting design decision to make. But AFAICT it's orthogonal to whether we should use passmissing=true and skipmissing=false by default or not. ~And it doesn't concern passmissing=true, which would not apply to combine~ EDIT: it does if we allow broadcasting a scalar result to all non-missing rows.

The main question for choosing the defaults is whether we favor safety or convenience. I'm always torn between these two goals, which is why I tried to proposed in the OP various solutions that don't require silently ignoring missing values by default.

Sorry what is passmissing? the concept of passmissing only makes sense to me in ByRow where you could conceivably have missing as an input instead of a vector.

@pdeffebach It also works for any function passed to select and transform: we just pass a view of non-missing rows, and copy the resulting vector to the non-missing rows, filling with missing other entries. This is what you proposed above.

nalimilan avatar Aug 25 '20 15:08 nalimilan

I think @pdeffebach was just disagreeing with my claim that combine with skipmissing = true is what Stata does (while mean ignores missing values, first gets the first element of the column, whether it’s missing or not)

matthieugomez avatar Aug 25 '20 15:08 matthieugomez

We could also define the kwarg maskmissing, and use it for filter/combine/transform, rather than using a combination of passmissing/skipmissing kwarg.

The definition of maskmissing is that it applies the operation on rows where columns are non missing. This explanation is consistent with what we want for filter/combine/transform, and so this would allow to get the same kwarg for all of these functions. (of course, for combine, it is exactly the same as skipmissing).

matthieugomez avatar Aug 25 '20 16:08 matthieugomez

The thing is, people don't really complain about na.rm = TRUE though, right? I feel like most of the problems we see about missing are really easily fixed by pointing people to skipmissing and passmissing.

combine(gd, :col => (x -> sum(skipmissing(x)))

Is really ugly, yes, but in the future you will be able to write

using DataFramesMeta
@combine(gd, y = sum(skipmissing(col))

which is not that bad.

pdeffebach avatar Aug 25 '20 16:08 pdeffebach

For me the problem is that for now it requires different strategies for filter (which requires coalesce), combine (which requires skipmissing). Another issue with combine is the whole thing about reduction other several vectors with different missing rows such as weighted mean, correlation etc )

Finally, transform requires sometimes passmissing, sometimes skipmissing, coalesce for ifelse calls, something for tryparse, etc. Also, there’s no good way currently to handle functions with vector input and vector output such as cumsum.

.

matthieugomez avatar Aug 25 '20 16:08 matthieugomez

That's fair. It would be hard to tackle each of these individually,

  1. filter should apply coalesce by default.
  2. skipmissing = true should apply passmissing to ByRow functions
  3. We need a lazy broadcasted ifelse that skips bad elements
  4. Better tryparse or a whole destring package which would take a lot of work

Then there’s the whole thing about reduction other several vectors with different missing rows such as weighted mean, correlation etc

Also an issue, skipmissings helps weighted mean, but unfortunately cor didn't make it into Statistics.

Okay that's fair. I'm on board with a keyword argument with a variety of behaviors depending on the context.

pdeffebach avatar Aug 25 '20 16:08 pdeffebach

I will take a defensive stance here.

Before I state what I think I will explain the principles of my judgement:

  1. DataFrames.jl should provide primitives that are flexible enough to perform what user needs in a performant way
  2. Other packages (like DataFramesMeta.jl or DataFramesMacros.jl) can add convenience wrappers; this does not mean that we do not want to be convenient in DataFrames.jl, but this is not a priority
  3. We take "safety first" and "consistence with Julia Base" policy
  4. We do not want to be breaking unless it is really justified

Given this:

  1. filter should apply coalesce by default.

I think we can add skipmissing kwarg with false as default. This is what Base does. If we can convince Julia Base to change the behaviour here then we will sync with that behaviour.

  1. skipmissing = true should apply passmissing to ByRow functions

Do you mean it for select/transform or also for combine? (also see my comment below)

  1. We need a lazy broadcasted ifelse that skips bad elements
  2. Better tryparse or a whole destring package which would take a lot of work

Agreed, but it is out of scope of DataFrames.jl. I would also add x -> coalesce(x, false) to the list with some convenient name, I think coalescefalse is explicit but maybe you judge it to be too long. I would add this to Missings.jl.

but unfortunately cor didn't make it into Statistics.

I think it is an open issue how to do it rather than closed issue as there are many ways to compute cor on a matrix with missings.

(x -> sum(skipmissing(x)))

this is normally written as sum∘skipmissing and it is not that bad I think.


Now regarding the design the more I think about it the more I want do the following that gives a fine grained control of what is going on in the function in a single transformation:

For skipmissing

would not be a kwarg but it would be an extension to the minilanguage we have for select, transform and combine. If in the form:

source => fun => destination

if you wrap source in skipmissing then fun in combine, select and transform gets views of passed columns with rows containing missings skipped.

In general the idea is that source => fun∘skipmissing is the same as skipmissing(source) => fun => destination if source is a single column, but we add this extension to allow easy handling of skipping rows in multiple-column scenarios

For filter we do not add anything but just write users to use coalescefalse∘fun if we agree to add such a function to Missings.jl.

For passmissing

It is not needed. Just tell users to write passmissing(their_function) - this is exactly what you want in ByRow and if I understand things correctly this is a main use case - is this correct?

bkamins avatar Aug 25 '20 18:08 bkamins

I think this is a good proposal.

We might not even need skipmissing in the meta-language. We can add a skipmissingsfun wrapper similar to passmissing that wraps all <:AbstractVector inputs in skipmissings. Though since it applies to the source I think a SkipMissing is more logical.

pdeffebach avatar Aug 25 '20 20:08 pdeffebach

@bkamins I am trying to understand your proposal:

  • What are you proposing for filter? It sounds like you mention two different solutions.
  • how would you handle transform with a function like cumsum?

matthieugomez avatar Aug 25 '20 20:08 matthieugomez

filter

My comment in the first part was just that I do not want skipmissing=true by default (no matter what syntax we apply).

My proposal is not to change filter at all but define coalescefalse(x) = coalesce(x, false) (the name can be discussed as this one is a bit long) and then just write:

filter(coalescefalse∘fun, df)

The beauty is that it would also work in Julia Base then which is a big benefit.

cumsum

I understand you want this:

julia> df = DataFrame(x=[1,2,missing,3])
4×1 DataFrame
│ Row │ x       │
│     │ Int64?  │
├─────┼─────────┤
│ 1   │ 1       │
│ 2   │ 2       │
│ 3   │ missing │
│ 4   │ 3       │

julia> transform(df, :x => (x -> cumsum(coalesce.(x, 0)) + x .* 0) => :y)
4×2 DataFrame
│ Row │ x       │ y       │
│     │ Int64?  │ Int64?  │
├─────┼─────────┼─────────┤
│ 1   │ 1       │ 1       │
│ 2   │ 2       │ 3       │
│ 3   │ missing │ missing │
│ 4   │ 3       │ 6       │

Then I agree - I would not handle this in my proposal. I assume that passmissing is mostly for ByRow as noted above.

The point is that the use cases like cumsum I think are quite rare (maybe I am wrong here).


The general idea is to try using function composition and higher order functions rather than keyword arguments as this seems more flexible for the future.


We can add a skipmissingsfun wrapper similar to passmissing that wraps all <:AbstractVector inputs in skipmissings.

I was thinking about it, and initially judged that this is not possible as fun in source => fun => destination can get three things:

  • a list of vectors as positional arguments
  • a NamedTuple of vectors
  • an AbstractDataFrame

But now as I think about it actually it is not a problem as this higher order function could just handle these three cases separately in if-else blocks. It is better than my original proposal as it is more composable (select/transform/combine do not even need to know what it does - and this is good because it orthogonality in design makes it much easier to maintain in the long term) The question is what name could be used here. The issue is that skipmissing is defined in Julia Base and adding skipmissing(::Base.Callable) would be type piracy (though it is something that is invalid in Base currently - not sure what other name would be good though - maybe dropmissing?)

bkamins avatar Aug 25 '20 21:08 bkamins

@bkamins pointed me here, but unfortunately this thread is a bit longer than I've currently got time for. The reason I was pointed here was that I asked for comments on a proposal that touches on the topic of this thread: I am actually mostly in favour of a more cumbersome, force-the-user-to-think-carefully-and-be-explicit approach for the most part, my main gripe with missing is when I filter a DataFrame (i.e. use ==, >, <). In those cases, the additional verbosity required to get things to work is quite large, and to me at least the possibility of accidentally doing something unintended is low - I've never encountered a use case where in doing df.col .> 5 I wanted rows for which col is missing to be included in the result.

With these two considerations in mind, in my own code I started defining:

⋘(x, y) = (ismissing(x) | ismissing(y)) ? false : (x < y)
⋙(x, y) = (ismissing(x) | ismissing(y)) ? false : (x > y)
⩸(x, y) = (ismissing(x) | ismissing(y)) ? false : (x == y)

I clearly didn't spend much time choosing the opearators, I just wanted infix operators that look similar to but different from the regular comparison operators (one issue with my current selection is that their names are completely different: \verymuchless, \ggg and \equivDD aren't exactly easy to memorize as a group of operators...)

In any case this relatively simple case for me solves the main usability issue I encounter in day-to-day work.

nilshg avatar Aug 26 '20 12:08 nilshg

@bkamins you mentioned there might be a way to have ifelse, filter work with missing values in base. You really think so? Maybe that’s one way to go.

matthieugomez avatar Aug 26 '20 15:08 matthieugomez

Maybe that’s one way to go.

I meant that this issue could be opened in Julia Base and discussed there. I feel that if it would get support it would be 2.0 release the earliest. Therefore I think we should have a solution for now in DataFrames.jl that does not rely on Julia Base.

(still - if you feel it is worth discussing please open an issue on Julia Base; I just recommend to be patient - core devs tend to be conservative and require quite a lot of justification for changes)

EDIT Actually - given our discussions apart from ifeslse and filter the third big thing is getindex with Union{Missing, Bool}} eltype selector

bkamins avatar Aug 26 '20 16:08 bkamins

I agree that it would also be OK to keep DataFrames as a relatively low-level package which doesn't offer the most convenient syntax, as long as it provides the building blocks to do anything. Anyway it's clear that the "mini-language" will never be as intuitive as DataFramesMeta's macros. So maybe we should try to see at the same time how we want to make working with missing values easy in DataFramesMeta, and see what we need in DataFrames to allow implementing it.

Adding skipmissing(f::Callable) to Base sounds doable. I'm less sure about filter and getindex skipping missing values, as this goes against what happens everywhere else (missing values are never dropped silently). But we don't need that for DataFrames/DataFramesMeta AFAICT.

The difficulty with finding a good solution in DataFramesMeta is that the equivalent implementations like dplyr rely on the fact that functions are vectorized and/or propagate missing values in totally ad-hoc way. For example, R/dplyr allows mutate(df, y = isna(x) ? "0" : as.character(x)), where as.character propagates missing values, but isna does not. R/dplyr also allows mutate(df, y = x - mean(x)) since - is vectorized, but we require .- in Julia -- which is OK here but can get quite unwieldy with complex commands. The fact that Julia is much more systematic is a strength for advanced users, but it can also be a barrier for more basic use, as R usually does "the right thing" (but when it doesn't it's a pain).

Maybe we could handle this difficulty by having all macros in DataFramesMeta operate row-wise by default, and require a special syntax to indicate that a column should be accessed as a vector. For example, @transform(df, y = x - mean($x)). Then it seems to me that the most intuitive behavior is this:

  • all columns accessed as vectors (with $) are wrapped in a view of non-missing values by default (i.e. skipmissing=true)
  • if some columns are accessed row-wise (no $), set passmissing=true (since we are in a case equivalent to ByRow)
  • if a function applied to a column vector (accessed using $) returns a vector (with its length equal to the number of non-missing rows) rather than a scalar, set passmissing=true

AFAICT this would do "the right thing" in the following situations:

  • @transform(df, y = mean($x)): assign to all rows (including those with missing values) the mean of non-missing values
  • @transform(df, y = x - mean($x)): subtract to each non-missing row the mean of non-missing values
  • @transform(df, y = cumsum($x)): assign to rows with non-missing values the cumulative sum of values for these rows
  • @transform(df, y = s - cumsum($x)): subtract to each non-missing row the cumulative sum of values for these rows

Cases that wouldn't do "the right thing" by default are uses of ismissing (as missing values would always be skipped by default). We decide to be smart and set automatically passmissing=false or skipmissing=false when an expression contains ismissing, or just throw an error with an informative message. A more tricky case are logical operators which implement three-valued logic: with passmissing=true, true | missing would be assigned missing since | would not actually be called. Maybe that's not a big deal.

nalimilan avatar Aug 26 '20 16:08 nalimilan

I was hoping that DataFramesMeta would just be DataFrames + a macro to create Pairs expressions (e.g. DataFramesMacros), but maybe that's the wrong way to look at it. My concern is that having a barebone DataFrames effectively creates 3 different syntaxes: the simple df.a syntax, the transform syntax, and the @transform syntax.

I need to think more about these proposals. In the meantime, I just wanted to point out that Stata has two different commands for row-wise vs, column-wise transforms (gen v.s. egen). That is something to think about if that would simplify thing, e.g. there could be a row version of transform, called, say, alter.

On Wed, Aug 26, 2020 at 9:41 AM Milan Bouchet-Valat < [email protected]> wrote:

I agree that it would also be OK to keep DataFrames as a relatively low-level package which doesn't offer the most convenient syntax, as long as it provides the building blocks to do anything. Anyway it's clear that the "mini-language" will never be as intuitive as DataFramesMeta's macros. So maybe we should try to see at the same time how we want to make working with missing values easy in DataFramesMeta, and see what we need in DataFrames to allow implementing it.

Adding skipmissing(f::Callable) to Base sounds doable. I'm less sure about filter and getindex skipping missing values, as this goes against what happens everywhere else (missing values are never dropped silently). But we don't need that for DataFrames/DataFramesMeta AFAICT.

The difficulty with finding a good solution in DataFramesMeta is that the equivalent implementations like dplyr rely on the fact that functions are vectorized and/or propagate missing values in totally ad-hoc way. For example, R/dplyr allows mutate(df, y = isna(x) ? "0" : as.character(x)), where as.character propagates missing values, but isna does not. R/dplyr also allows mutate(df, y = x - mean(x)) since - is vectorized, but we require .- in Julia -- which is OK here but can get quite unwieldy with complex commands. The fact that Julia is much more systematic is a strength for advanced users, but it can also be a barrier for more basic use, as R usually does "the right thing" (but when it doesn't it's a pain).

Maybe we could handle this difficulty by having all macros in DataFramesMeta operate row-wise by default, and require a special syntax to indicate that a column should be accessed as a vector. For example, @transform(df, y = x - mean($x)). Then it seems to me that the most intuitive behavior is this:

  • all columns accessed as vectors (with $) are wrapped in a view of non-missing values by default (i.e. skipmissing=true)
  • if some columns are accessed row-wise (no $), set passmissing=true (since we are in a case equivalent to ByRow)
  • if a function applied to a column vector (accessed using $) returns a vector (with its length equal to the number of non-missing rows) rather than a scalar, set passmissing=true

AFAICT this would do "the right thing" in the following situations:

  • @transform(df, y = mean($x)): assign to all rows (including those with missing values) the mean of non-missing values
  • @transform(df, y = x - mean($x)): subtract to each non-missing row the mean of non-missing values
  • @transform(df, y = cumsum($x)): assign to rows with non-missing values the cumulative sum of values for these rows
  • @transform(df, y = s - cumsum($x)): subtract to each non-missing row the cumulative sum of values for these rows

Cases that wouldn't do "the right thing" by default are uses of ismissing (as missing values would always be skipped by default). We decide to be smart and set automatically passmissing=false or skipmissing=false when an expression contains ismissing, or just throw an error with an informative message. A more tricky case are logical operators which implement three-valued logic: with passmissing=true, true | missing would be assigned missing since | would not actually be called. Maybe that's not a big deal.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaData/DataFrames.jl/issues/2314#issuecomment-680993848, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPPPXI2TMYOSLOB3XBVAWTSCU3KHANCNFSM4OTTLDUA .

matthieugomez avatar Aug 26 '20 17:08 matthieugomez

@nalimilan - thank you for commenting on this. I guess it is best to move the discussion to DataFramesMeta.jl. My questions would be twofold:

  • how would you turn-off this behaviour in DataFramesMeta.jl if it is not wanted
  • in s - cumsum($x) I was not clear what would happen in the scenario s=[missing, missing, 1] and x=[1,2,3].

@matthieugomez - I agree that we could have DataFramesMacros.jl (when it matures) be just an addition to DataFrames.jl and this should work. But for DataFramesMeta.jl I accept it adds new things (just as e.g. queryverse adds) - the user simply has to choose what one wants to use.

However, I feel that with proper composition/higher order functions we can get quite far. E.g. if we find short forms of what @nilshg proposes we can already solve most of the problems. I was thinking about it and even defining normal functions with names lt, leq, gt, geq, eq defined like:

lt(x,y) = coalesce(x<y, false)

would be super convenient. I would propose to have them in Missings.jl. We would take hold of 5 very short names (bad thing) but would get flexibility in a place where it is much needed, also I think writing lt(x,y) is not that much worse than x < y. If we had these five functions I think that mostly we do not need coalescefalse, as in other cases user can just write x -> coalesce(x, false) since they will be rare anyway.

EDIT: if we went for this also neq

bkamins avatar Aug 26 '20 17:08 bkamins

I agree with @matthieugomez, especially with upcoming multithreading in transform I think it would be very hard to maintain feature parity if the two packages diverged too much.

I think @nalimilan's proposal is good, but is not functionally different from a DropMissing([:x1, :x2]) => fun syntax that could go in the data frames mini-language.

I do agree with Milan that ByRow being the default would solve a lot of these problems.

However, I feel that with proper composition/higher order functions we can get quite far. E.g. if we find short forms of what @nilshg proposes we can already solve most of the problems. I was thinking about it and even defining normal functions with names lt, leq, gt, geq, eq defined like:

I think a very smart macro could fix this easier.

@missingfalse x > y

pdeffebach avatar Aug 26 '20 17:08 pdeffebach

I think a very smart macro could fix this easier.

@missingfalse x > y

this is clearly doable as it is the same as:

coalesce(x>y, false)

but just a bit longer to write :smile:

My point is - and this is how I understood what @nilshg wanted is to have something that is easy to type. And I feel that writing e.g. lt.(df.col1, df.col2) is not much worse than df.col1 .< df.col2 and much better than @missingfalse df.col1 .< df.col2 or coalesce.(df.col1 .< df.col2, false).

bkamins avatar Aug 26 '20 17:08 bkamins

I just wanted to point out that Stata has two different commands for row-wise vs, column-wise transforms (gen v.s. egen). That is something to think about if that would simplify thing, e.g. there could be a row version of transform, called, say, alter.

@matthieugomez Indeed. My proposal is basically a way to combine Stata's gen and egen in a single expression, using $ to distinguish them. Anyway I think I would prefer specifying something like ByRow in @transform/@select (whatever the exact syntax) than a separate function, as it makes it easier to understand the API than adding functions with totally different names.

Not allowing mixing column-wise and row-wise operation like in my proposal would certainly be simpler. The main drawback would be that one would have to use broadcasted operations, e.g. x .- mean(x) rather than x - mean($x). Maybe that's OK but that increases the complexity.

* how would you turn-off this behaviour in DataFramesMeta.jl if it is not wanted

@bkamins I think you could just pass passmissing and skipmissing arguments to the macro, a bit like we discussed for functions.

* in `s - cumsum($x)` I was not clear what would happen in the scenario `s=[missing, missing, 1]` and `x=[1,2,3]`.

cumsum($x) would become cumsum([1]) == [1], and then we would call cs = cumsum([1]); x -> s - cs on each non-missing row (i.e. the last one with x == 3), filling the first two rows with missing.

However, I feel that with proper composition/higher order functions we can get quite far. E.g. if we find short forms of what @nilshg proposes we can already solve most of the problems. I was thinking about it and even defining normal functions with names lt, leq, gt, geq, eq

I must say I'm not a fan of ad-hoc solutions like that. This would solve some of the issues, but not all of them and people will have to learn both the custom functions and the general mechanism using passmissing. Also we would lose the nice x < y syntax, which is really lousy for a high-level language. A macro like @missingfalse would be better, though it's still a bit verbose for DataFramesMeta I would say.

I agree with @matthieugomez, especially with upcoming multithreading in transform I think it would be very hard to maintain feature parity if the two packages diverged too much.

@pdeffebach Yes that's a concern. Everything DataFramesMeta does should be translatable relatively directly to DataFrames commands so that we don't implement things twice.

I think @nalimilan's proposal is good, but is not functionally different from a DropMissing([:x1, :x2]) => fun syntax that could go in the data frames mini-language.

I do agree with Milan that ByRow being the default would solve a lot of these problems.

Actually I think my proposal is a bit different, as you need to run first column-wise operations and then insert them in the row-wise function. I'm not sure how it could be translated into the select/transform syntax...

nalimilan avatar Aug 26 '20 17:08 nalimilan

OK - given these comments can you explain why you think:

coalesce(x>y, false)

is bad?

bkamins avatar Aug 26 '20 17:08 bkamins

I don't think it's bad as a low-level way of doing things, as it always works and is clear about what it does. I just think for DataFramesMeta we need a general solution which allows you to write the code as if there weren't any missing values (possibly adding an argument/macro to enable that behavior), so that you don't need to constantly adjust the code using various strategies which are hard to remember.

nalimilan avatar Aug 26 '20 17:08 nalimilan

OK - for DataFramesMeta.jl this is fine. But then what for DataFrames.jl?

We started a discussion with adding skipmissing and passmissing keyword arguments. For me - from the discussion - it turns out that other options like coalesce(x>y, false) are not that much longer and they are still clear what is going on as you noted. So is then your suggestion to leave things in DataFrames.jl as they are and just improve the documentation to teach users properly use:

  • skipmissing
  • pasmissing
  • coalesce

? (maybe just adding the skipmissingfun wrapper - probably with a shorter name - as @pdeffebach proposed as it cannot be handled using the options above when you pass multiple columns in source => fun => destination)

bkamins avatar Aug 26 '20 17:08 bkamins

proposed as it cannot be handled using the options above when you pass multiple columns in source => fun => destination)

It can! I wrote a PR for that in Missings that does this.

pdeffebach avatar Aug 26 '20 17:08 pdeffebach

I know this PR, but it does not do what we need in DataFrames.jl:

get three things:

  • a list of vectors as positional arguments
  • a NamedTuple of vectors
  • an AbstractDataFrame

and that solution does not handle them all unfortunately (otherwise this is exactly what we need :+1: )

bkamins avatar Aug 26 '20 17:08 bkamins

What we could do is:

  1. add NamedTuple of vectors support in Missings.jl - @pdeffebach => would you agree to add it there
  2. release Missings.jl
  3. add AbstractDataFrame support in DataFrames.jl - as this would not be a type piracy => @pdeffebach => also you are then welcome to add this method (it should be just a view of this data frame removing rows with missings)

bkamins avatar Aug 26 '20 18:08 bkamins

OK, so given also a short discussion with @pdeffebach on Slack:

  • let us add methods to skipmissings and make a release of Missings.jl - I understand that @pdeffebach would start working on this. It would be good to do it relatively quickly, so that we can have skipmissings added also to DataFrames.jl in 0.22 release.
  • Independent of this a SkipMissing wrapper of the form SkipMissing(source) => fun => destination can be added that will do the following:
    1. take source columns
    2. apply skipmissings to them
    3. perform computation and pseudo broadcasting into length of the input after performing skipmissings
    4. finally map the result of step 3 into original length of source leaving missing values in rows that were skipped by skipmissings

The SkipMissing wrapper would be supported only in select and transform and if GroupedDataFrame was passed to it it would not employ the fast path (but I guess this is most likely not a use case for fast path anyway - right?).

bkamins avatar Aug 27 '20 06:08 bkamins

Sounds like a good idea for DataFrames and probably for Tables.jl in general.

For DataFramesMeta, here's a simpler proposal than the one I made above, and which should be easy to implement using DataFrames functions. @transform/@select/@combine would be either row-wise or column-wise, with some syntax like ByRow/ByCol, @byrow/@bycol or byrow=true/bycol=true to switch between these (not sure which one should be the default, that's a separate discussion). In some cases this is a bit annoying as you need to use broadcasting explicitly when operating column-wise, e.g. in @filter(df, x .> mean(x)) or @transform(df, y = x .- mean(x)) but maybe it's not the end of the world as these are not the most common operations.

Missing values would be propagated or skipped by default. For row-wise operation, (the equivalent of) passmissing=true would be the default (and we could throw an error if ismissing is used to help users). For column-wise operation, skipmissing=true would be the default; additionally for @select/@transform, if the function returns a vector (of the same length as its input), the passmissing=true behavior would be used. This will allow "doing the right thing" for both y = mean(x) and y = x .- mean(x). So in practice ByRow(f) and x -> f.(x) would be equivalent, which is an important property. One would be able to pass passmissing and skipmissing manually as arguments to override the default. This would be needed notably when recoding missing values, using ismissing or things like replace(x, missing => 0) (which is actually the main drawback of propagating missing values).

Does that sound reasonable?

nalimilan avatar Aug 27 '20 10:08 nalimilan

Independent of this a SkipMissing wrapper of the form SkipMissing(source) => fun => destination can be added that will do the following:

  1. take `source` columns
  2. apply `skipmissings` to them
  3. perform computation and pseudo broadcasting into length of the input after performing `skipmissings`
  4. finally map the result of step 3 into original length of `source` leaving `missing` values in rows that were skipped by `skipmissings`

@bkamins What you describe sounds useful in general, and in particular for my last DataFramesMeta proposal (though note I suggest filling all rows and not just non-missing rows when the function returns a scalar). But I think we should find another name for this, as I would expect SkipMissing(f) to be equivalent to f ∘ skipmissing. So this is unlikely to be accepted in Base. SkipAndFillMissing would be accurate but a bit verbose. SkipFillMissing? Or just FillMissing?

The SkipMissing wrapper would be supported only in select and transform and if GroupedDataFrame was passed to it it would not employ the fast path (but I guess this is most likely not a use case for fast path anyway - right?).

The fast path only supports single-row reductions anyway, right? So that doesn't apply here.

nalimilan avatar Aug 27 '20 10:08 nalimilan

@nalimilan - thank you for the comments.

I would expect SkipMissing(f) to be equivalent to f ∘ skipmissing

The point is that @pdeffebach proposed to have SkipMissing not on fun but on source, so you would write SkipMissing([:x1, :x2]) => cor => :cor for instance, and this allows us to define it more flexibly (but I agree we could also have a better name if we can think of any)

I suggest filling all rows and not just non-missing rows when the function returns a scalar

I was afraid that it would be confusing that returning a vector and returning a scalar behave differently. But maybe this is OK (on purpose I excluded combine from the list as in that case we would hit many more problems, with select and transform things are easier).

I think we need to decide here which operation is more useful: your proposal or my proposal. I feel that my proposal gives a more consistent design, but maybe the usefulness of your proposal outweighs this shortcoming (but let us deeply discuss this, not to go back to it later - like we have recurring discussions if select should be by row or by whole column by default - as when we make a decision it will be fixed for a long time - 😄).

In particular - given your semantics how would you achieve my result (i.e. fill only non-missing with the aggregate)?

The fast path only supports single-row reductions anyway, right? So that doesn't apply here.

If we accept your proposal (i.e. fill scalars everywhere) then there is no difference. If we accept my proposal (i.e. broadcast scalar only to non-missing slots) then there is a difference between:

select(df, :x , :y => sum ∘ skipmissing)

and

select(df, :x , SkipMissing(:y) => sum)

bkamins avatar Aug 27 '20 10:08 bkamins

The point is that @pdeffebach proposed to have SkipMissing not on fun but on source, so you would write SkipMissing([:x1, :x2]) => cor => :cor for instance, and this allows us to define it more flexibly (but I agree we could also have a better name if we can think of any)

OK. The problem is, SkipMissing(cols) => fun already has a meaning, i.e. apply fun to columns in vector cols, excluding missing values if it contains some. Granted, it's a quite unlikely usage, but still...

In particular - given your semantics how would you achieve my result (i.e. fill only non-missing with the aggregate)?

In DataFramesMeta, you would just pass passmissing=true explicitly. In DataFrames, you would use skipmissing or skipmissings manually instead of using SkipMissing(cols) => fun/SkipFillMissing(cols) => fun/whatever. In both, you could also return fill(value, length(x)) instead of just value.

I agree this is an important point to settle. I suspect that in practice when doing reductions rows with missing values should get the same value as others. This is the case for example if you want to add a variable containing group means. That's what Stata's egen does AFAICT (right @matthieugomez?). Can we find examples where we would like the opposite behavior (i.e. rows with missings are filled with missing instead of with the returned scalar)?

nalimilan avatar Aug 27 '20 12:08 nalimilan

SkipMissing(cols)

Ah - I have forgotten that it is in Base, but it is just unexported. Then yes - we should not change the meaning of this name.

In both, you could also return fill(value, length(x)) instead of just value.

Fair point, except that in ByRow you do not have an access to length(x), especially in GroupDataFrame context. (but you can get it using other means anyway so it would be OK)

Can we find examples where we would like the opposite behavior

Thinking of it - you are probably right. As later if you use this reduction with missing in the original column it will become missing anyway.

The only reduction I can think of is:

transform(df, SkipMissing(:x) => x -> true)

where we get true indicator for non-missing, but of course it does not make much sense to do such a transformation.

bkamins avatar Aug 27 '20 13:08 bkamins

In Stata, mean of a variables creates a variable equal to the mean on non-missing rows, and missing values otherwise. Edit: no sorry it creases the mean for every rows.

I have also realized that, one issue with passmissing for functions that return vectors (as proposed by @nalimilan), is that lag will not work correctly.

matthieugomez avatar Aug 27 '20 14:08 matthieugomez

is that lag will not work correctly.

can you please give an example what you exactly mean?

bkamins avatar Aug 27 '20 14:08 bkamins

@matthieugomez I was trying to write a post comparing

gen y = mean(x)
gey y = mean(x) if !missing(x)

But I don't have a Stata installation at the moment. Can you confirm when the mean is different?

I think the proposed behavior is fairly close to idiomatic stata. Enough that you can translate one-for-one, but I want to make sure .

pdeffebach avatar Aug 27 '20 14:08 pdeffebach

@nalimilan Yes in the second case the mean is only given on rows that are not missing.

matthieugomez avatar Aug 27 '20 14:08 matthieugomez

@bkamins sorry: I expect lag([1, missing]) to return [missing, 1]. I think with @nalimilan’s proposal and passmissing = true within a transform call, it would return [missing, missing]

matthieugomez avatar Aug 27 '20 14:08 matthieugomez

sorry: I expect lag([1, missing]) to return [missing, 1]. I think with @nalimilan’s proposal and passmissing = true within a transform call, it would return [missing, missing]

That's fine, though. lag doesn't error with missing values so there is no need to do it (assuming we don't make DropMissing the default, which I don't think we should yet).

lags are hard anyways, you have to remember grouping etc.

pdeffebach avatar Aug 27 '20 15:08 pdeffebach

Yes, lag/lead/diff are another case that would require opting-out explicitly of missing values propagation (if we make it the default). If we think this kind of situation (in addition to ismissing, etc.) is too common or too confusing, we could avoid it by default and require people to use skipmissing=true/ passmissing=true explicitly. Anyway we could first try this system and later see whether it should be the default.

But for simplicity I think it would be better to have a single argument called skipmissing that would do both, since we concluded that there are no major use cases in which passmissing and skipmissing would need to be separate. For row-wise operation, skipmissing somewhat makes sense as a way to say "only apply the operation to non-missing rows (and fill others with missing)". It's a slight abuse but probably OK. For column-wise operation, skipmissing would be exactly what happens to input columns, and the passmissing behavior would just be a consequence of that: if the function returns a single scalar, we can fill all rows, including missing ones (passmissing=false in our terminology), but if it returns a vector of the same length as its input, we can only fill non-missing rows (passmissing=true in our terminology).

nalimilan avatar Aug 27 '20 15:08 nalimilan

(if we make it the default)

I understand you are talking about DataFramesMeta.jl here - right. In DataFrames.jl I do not think we will make it a default as it would be breaking.

But for simplicity I think it would be better to have a single argument called

Again - I understand you mean it for DataFramesMeta.jl - right? As for data frames, we want to have a wrapper around column selector in source => fun => destination (we need a name as we cannot use SkipMissing).

bkamins avatar Aug 27 '20 15:08 bkamins

Regarding the discussion #2484.

I think that where should have the following specification (I leave out some sanity checks but give a simplified idea of the impementation):

  • where(df::DataFrame, args...) = df[isequal.(.&(eachcol(select(df, args..., copycols=false))), true), :]
  • where(gdf::GroupedDataFrame, args...) = parent(df)[isequal.(.&(eachcol(select(gdf, args..., copycols=false, keepkeys=false))), true), :]

An element of args can only be of the form cols => fun and we transform it to cols => fun => x_i (to make sure we have unique output column names).

Regarding the comment by @nalimilan:

One potential issue is that having where skip missing values but other functions like combine, select and transform not skip them

I do not propose to skip missing values. What I propose is to use isequal not == for testing (and this is a logically valid rule). Actually in my codes I normally use isequal because of this for logical conditions as a rule.

We could add a errror_on_missing (or something similar) kwarg to where to choose if we use isequal or == to do the test (but I would still make it default to false, i.e. use isequal by default)

bkamins avatar Oct 19 '20 10:10 bkamins

You need gdf above, as transformations are on the group level when given a grouped data frame above.

I think we can use isequal without addressing transform and select. We aren't skipping missing during the transformation, just when we choose which rows to keep.

pdeffebach avatar Oct 19 '20 13:10 pdeffebach

I don't think we are ready to address transform and select yet. There hasn't been consensus on behavior. better to implement something in Missings.jl here and give users a chance to see if they like it.

pdeffebach avatar Oct 19 '20 13:10 pdeffebach

You need gdf above

fixed, it was a typo

bkamins avatar Oct 19 '20 13:10 bkamins

+1. It would also be nice to have a where! version, as well as a view kwarg for where.

Last thing is the name. I like where. The only issue is that it is used for special syntax in Julia. Is that a problem? In particular, I think it creates some issues with using Lazy’s macro @>, but maybe it can be fixed.

On Mon, Oct 19, 2020 at 6:27 AM Bogumił Kamiński [email protected] wrote:

You need gdf above

fixed, it was a typo

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaData/DataFrames.jl/issues/2314#issuecomment-712156073, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPPPXPIIASGWALSXHZIEZ3SLQ5FXANCNFSM4OTTLDUA .

matthieugomez avatar Oct 19 '20 15:10 matthieugomez

It would also be nice to have a where! version, as well as a view kwarg for where.

Sure - we would mirror that - thank you for raising this, as it is easy to forget.

I like where

I think if we like it (and I guess we do not have a better name) then we should try to make sure that the "ecosystem" works with it. For sure this is not a problem for DataFramesMeta.jl. @pdeffebach - have you experimented with this name in DataFramesMeta.jl.

bkamins avatar Oct 19 '20 15:10 bkamins

where works with linq, but the Lazy pipings all work with @where so there hasn't been a conflict.

pdeffebach avatar Oct 19 '20 16:10 pdeffebach

I do not propose to skip missing values. What I propose is to use isequal not == for testing (and this is a logically valid rule). Actually in my codes I normally use isequal because of this for logical conditions as a rule.

We could add a errror_on_missing (or something similar) kwarg to where to choose if we use isequal or == to do the test (but I would still make it default to false, i.e. use isequal by default)

Well I'd say you're playing with words. :-) Selecting rows for with isequal(x, true) is really skipping rows for which x is missing. And I would be inclined to call the argument skipmissing=true instead of errror_on_missing=false (that would be more standard given our terminology).

TBH I'm not really sure I'm opposed to doing that, but I feel it's a bit inconsistent to consider that where shouldn't throw an error in the presence of missing values (when you don't handle them manually), but that combine/select/transform should. Of course it's much simpler and obvious what to do to handle missing values with where, so it's easier and less risky to implement. But in both cases 1) you lose some safety if you didn't expect missing values to be present and they happen to be there, and 2) it's still painful to work with combine/select/transform in the presence of missing values.

nalimilan avatar Oct 19 '20 20:10 nalimilan

Is not it what dplyr does already (ie skip missing in filter but not in mutate)?

On Mon, Oct 19, 2020 at 1:13 PM Milan Bouchet-Valat < [email protected]> wrote:

I do not propose to skip missing values. What I propose is to use isequal not == for testing (and this is a logically valid rule). Actually in my codes I normally use isequal because of this for logical conditions as a rule.

We could add a errror_on_missing (or something similar) kwarg to where to choose if we use isequal or == to do the test (but I would still make it default to false, i.e. use isequal by default)

Well I'd say you're playing with words. :-) Selecting rows for with isequal(x, true) is really skipping rows for which x is missing. And I would be inclined to call the argument skipmissing=true instead of errror_on_missing=false (that would be more standard given our terminology).

TBH I'm not really sure I'm opposed to doing that, but I feel it's a bit inconsistent to consider that where shouldn't throw an error in the presence of missing values (when you don't handle them manually), but that combine/select/transform should. Of course it's much simpler and obvious what to do to handle missing values with where, so it's easier and less risky to implement. But in both cases 1) you lose some safety if you didn't expect missing values to be present and they happen to be there, and 2) it's still painful to work with combine/select/transform in the presence of missing values.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaData/DataFrames.jl/issues/2314#issuecomment-712415696, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPPPXNZOOAMIVCDP2VM3TLSLSMX3ANCNFSM4OTTLDUA .

matthieugomez avatar Oct 19 '20 20:10 matthieugomez

Well I'd say you're playing with words. :-)

The point is that we are not skipping missing in the sense how e.g. select would skip them. As for select skipmissing means REMOVE FROM THE COLLECTION or IGNORE THEM while in where it is TREAT THEM AS FALSE. And this is the point of my reasoning. This is a different operation that we perform (skipping missings in the sense of select in where would make no sense).

skipmissing=true

For the above reason I have not proposed skipmissing kwarg but some other name (of course I am not attached to the proposed one as it is not nice obviously).

My point is that the decision what to do in select and how to implement where are logically orthogonal although they have very similar end effect.

bkamins avatar Oct 19 '20 21:10 bkamins

Regarding where for GroupedDataFrame we have two options:

  1. it would produce rows in the order of the parent (i.e. it would internally rely on select; in this scenario subsetting groups in GroupedDataFrame is not allowed)
  2. it would produce rows in the order of groups (i.e. it would internally rely on combine; in this scenario subsetting groups in GroupedDataFrame is allowed)

Which one do we prefer?

bkamins avatar Oct 20 '20 19:10 bkamins

  1. As it's what I've already implemented in DataFramesMeta. But in general i think its good to not re-order things too much, and this is way closer to the mental model of select then filter.

pdeffebach avatar Oct 20 '20 20:10 pdeffebach