DataFrames.jl Improwe workflows with filtered DataFrame

This has been discussed in several places I create a separate issue for this to keep track of it as it is an important functionality I think. What we want is filter(predicate, df, view=true) to return a view (so then we can conveniently update this view for example).

Aug 05 '20 18:08 bkamins

Now given https://github.com/JuliaData/DataFrames.jl/issues/2211#issuecomment-669730638 and https://github.com/JuliaData/DataFrames.jl/issues/2211#issuecomment-669730638

my question is if we should not also add dropmissing keyword argument that would:

if nothing do what we have now
if true then missings would be excluded from the selection
if false then missings would be included in the selection

This is essentially as adding coalesce wrapper to the filtering function (so maybe we feel that it is not needed), but maybe you will judge that:

filter(:x => >(1), df, dropmissing=yes)

reads better than:

filter(:x => x -> coalesce(x > 1, false), df)

we do not save typing here, but maybe you will find it more readable?

CC @nalimilan @matthieugomez @pdeffebach

Aug 06 '20 07:08 bkamins

I think I would prefer a kwarg skipmissing, where true corresponds to dropmissing = true, and false corresponds to dropmissing = nothing. I am not sure it is worth introducing a new kwarg dropmissing just to get the behavior dropmissing = false.

More generally, I think that the decision (and also the default of the kwarg option) should be taken in conjonction with the discussion for transform/select etc... https://github.com/JuliaData/DataFrames.jl/issues/2314

Aug 06 '20 14:08 matthieugomez

A good point. But I would not alter the defaults (sorry for this position here but I do want to avoid breaking changes of this kind till 1.0).

Aug 06 '20 16:08 bkamins

The only issue is that skipmissing in select etc. will have a bit different behavior than skipmissing here.

Aug 06 '20 16:08 bkamins

Yes. Personally, I prefer an argument that always have the same name (skipmissing), even if it has slightly different meanings in different contexts, rather than a different keyword argument everytime. I am not a fan on how R has na.rm, na.exclude, na.omit etc.

Aug 06 '20 16:08 matthieugomez

(ok - and I would appreciate a comment in #2314 what you think should be the results on the cases I have listed there :smile:)

Aug 06 '20 16:08 bkamins

I was thinking about these things and I would close #2211, #2314 and #2258 in favour of this issue which I would aim to solve all the issues raised in one consistent design.

So my proposal would be the following:

in filter and dropmissing functions if they take DataFrame or SubDataFrame with index Index (not just any AbstractDataFrame, and not for filter! and dropmissing!) add view::Bool=flase keyword argument. Of course view also allows to create it
if view=true then SubDataFrame is created with Index
We would add support for ! functions for such SubDataFrame (now it is an error) (here is a big task to list all that we feel it makes sense to support) and several more special cases:
- setproperty!, setindex!, select!, transform!, sort!, filter!, dropmissing! (where the schema of the parent is allowed to be updated by the operation with the rules that rows excluded by view are left untouched and if new columns are created they are filled with missing); if such functions return a source data frame then SubDataFrame does not get unwrapped and calling parent to require explicit unwrapping is required (this is because otherwise it is problematic in my opinion to ensure consistent and intuitive for entry-level users behaviour between transform and transform!) - this is what I would call "sticky behaviour"
- in particular it is allowed to do groupby on SubDataFrame with Index and it will have the same rules for select! and transform! as described above

So the only thing that does not go as you want is "sticky behavior", which I think we should have because of two reasons:

you can then chain operations and you do not have to filter rows again
we get a consistent behavior between ! and no-! functions; which I believe will be easier to understand for users

The decision to be made is if we want automatic column promotion for such SubDataFrame, but I think we do not want it and an error should be thrown in cases where promotion would be needed (this is what we currently do in other places).

A side benefit is that it will be much easier to implement it than doing all new stuff for WhereDataFrame and users will not have to learn one more type, which we already have many.

Now how it addresses the issues I mention:

#2211 : it will just be what is requested there
#2314 : using dropmissing with view=true will just do this, and it would be sticky as requested by @nalimilan
#2258 : it will just be what is requested there except that unwrapping will require calling parent after the operation, and by default we are sticky (I think it is not a great problem)

What do you think?

Aug 08 '20 09:08 bkamins

Makes sense. But I'm not sure it would really fix #2258 and #2314, as repeating dropmissing(df, view=true) everywhere is still quite verbose, and you'd have to repeat the columns for which you want to drop missing values. Recommending that pattern as the standard way to skip missing values could even be dangerous, as people would be likely to do transform(dropmissing(df, view=true), :a => ... => :a_new) when they mean transform(dropmissing(df, :a, view=true), :a => ... => :a_new).

Aug 08 '20 20:08 nalimilan

Makes sense. But I'm not sure it would really fix #2258 and #2314, as repeating dropmissing(df, view=true) everywhere is still quite verbose, and you'd have to repeat the columns for which you want to drop missing values.

I agree with @nalimilan. I think I have been confusing two distinct issues in my comments, which I apologize for.

Operations like "create a variable that is the average income tax payed by people, but only for women". For this you would use a view with all of the operations you described above.
Operations on data with many missing values, where you don't want to use passmissing and skipmissing everywhere. Actually doing a view here would require a lot of repeating which variables are the problem. For this you would use the skipmissing = true keyword argument, potentially.

Aug 08 '20 22:08 pdeffebach

Ah - OK. So for #2258 and #2314 it would be a "partial fix" (via a more general mechanism, which has it as a special case, but I agree we can think of more convenient patterns).

So do you think what I propose is OK for resolving #2211 (and having a partial solution for the other issues does not hurt, but of course let us discuss more convenient options)?

Aug 08 '20 22:08 bkamins

If the goal is to be able to make it easier to replace certain rows of a column if a condition is satisfied, I am not sure it is worth it.

That is because I do not think that the pattern

parent(transform!(filter(:x => >(1), df, view = true), :x => 1)))

is simpler than

transform!(df, :x => x-> ifelse.(x .>= 1, 1, x))

Aug 09 '20 16:08 matthieugomez

Yes, but all that was asked for in #2211 can be handled by ifelse and this is the currently (i.e. before adding views) recommended pattern.

@pdeffebach - given the discussion we have what use cases of #2211 do you see?

(even if we do not add #2211 functionality, still we can add view kwarg to filter and dropmissing as this is useful in its own rights, as it is a memory efficient way to filter a data frame)

Aug 09 '20 17:08 bkamins

@pdeffebach - given the discussion we have what use cases of #2211 do you see?

I agree with you both. i don't think this is that important.

I do like the way stata reads gen x if y > z. But that convenience can be handled by DataFramesMeta potentially.

Therefore I agree that we should continue thinking about better ways to make skipmissing work easily inside transform.

Aug 09 '20 17:08 pdeffebach

OK - so do we close #2211, or keep it open to get back to it in the future? (for me it is easier not to have #2211 as it will complicate codebase because it requires special handling in many functions, but if you feel we can go back to it some day then let us keep it open)

we should continue thinking about better ways to make skipmissing work easily inside transform.

Agreed - and thank you for spending your time on this.

Aug 09 '20 17:08 bkamins

Let's close it if it's easiest. I think DataFramesMeta can handle a lot of it's points.

Aug 09 '20 17:08 pdeffebach

Now thinking of it I realized that the issue also is that ifelse is limited in what you can express with it, and filter can take a quite complex predicate. But maybe this does not justify having it.

Also ifelse is eager, so it evaluates left and right hand side therefore the following fails:

julia> df = DataFrame(x = -3:3)
7×1 DataFrame
│ Row │ x     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ -3    │
│ 2   │ -2    │
│ 3   │ -1    │
│ 4   │ 0     │
│ 5   │ 1     │
│ 6   │ 2     │
│ 7   │ 3     │

julia> transform(df, :x => x-> ifelse.(x .> 0, log.(x), missing))
ERROR: DomainError with -3.0:
log will only return a complex result if called with a complex argument. Try log(Complex(x)).

Aug 09 '20 17:08 bkamins

Yes that is one problem. I've had that before and been frustrated by it.

We really need a lazy version of this in Base to be honest...

Another issue is that that "generate a standardized income index, among women, for all women" is tough.

ifelse.(woman == 1, standardized_index(income, wealth), missing)

The income and wealth aren't subsetted by gender above, whereas a View of the data frame would fix that problem.

Then again, I also am not sure what

gen income_index = ... if woman == 1

does in stata. I don't know how the variables are subsetted.

Aug 09 '20 18:08 pdeffebach

Right, I did not think about that . The ifelse pattern can only be used in transform for functions that are elementwise, not functions that compute reductions.

Aug 09 '20 18:08 matthieugomez

But ultimately I agree that this kind of ifelse pattern could be handled by a macro or some sort of struct with lazy broadcasting. It's a tool that has general convenience beyond dataframes and thus doesn't have to live here.

Aug 10 '20 17:08 pdeffebach

Yes, but it should not be ifelse but something else. ifelse is eager on purpose, as it makes it fast, because then it avoids branching in generated native code.

Aug 10 '20 18:08 bkamins

Given the decision in #2314 do we need the functionality described in #2211 or not?

Or maybe it is enough if we have this functionality for setindex!, insertcols! and broadcasted assignment (this would be relatively easy) but keep disallowing it in transform! and select! (this will be more complex both for SubDataFrame and GroupedDataFrame{SubDataFrame} if SubDataFrame is based on Index)?

(i.e. do we need mutating views - note that I recently noticed that we already had this for rename! but it was not intentional - now it is documented)

Aug 22 '20 22:08 bkamins

DataFrames.jl DataFrames.jl copied to clipboard

Improwe workflows with filtered DataFrame

DataFrames.jl
DataFrames.jl copied to clipboard