DataFrames.jl
DataFrames.jl copied to clipboard
Improwe workflows with filtered DataFrame
This has been discussed in several places I create a separate issue for this to keep track of it as it is an important functionality I think. What we want is filter(predicate, df, view=true)
to return a view (so then we can conveniently update this view for example).
Now given https://github.com/JuliaData/DataFrames.jl/issues/2211#issuecomment-669730638 and https://github.com/JuliaData/DataFrames.jl/issues/2211#issuecomment-669730638
my question is if we should not also add dropmissing
keyword argument that would:
- if
nothing
do what we have now - if
true
then missings would be excluded from the selection - if
false
then missings would be included in the selection
This is essentially as adding coalesce
wrapper to the filtering function (so maybe we feel that it is not needed), but maybe you will judge that:
filter(:x => >(1), df, dropmissing=yes)
reads better than:
filter(:x => x -> coalesce(x > 1, false), df)
we do not save typing here, but maybe you will find it more readable?
CC @nalimilan @matthieugomez @pdeffebach
I think I would prefer a kwarg skipmissing
, where true corresponds to dropmissing = true, and false corresponds to dropmissing = nothing. I am not sure it is worth introducing a new kwarg dropmissing
just to get the behavior dropmissing = false.
More generally, I think that the decision (and also the default of the kwarg option) should be taken in conjonction with the discussion for transform/select etc... https://github.com/JuliaData/DataFrames.jl/issues/2314
A good point. But I would not alter the defaults (sorry for this position here but I do want to avoid breaking changes of this kind till 1.0).
The only issue is that skipmissing
in select
etc. will have a bit different behavior than skipmissing
here.
Yes. Personally, I prefer an argument that always have the same name (skipmissing), even if it has slightly different meanings in different contexts, rather than a different keyword argument everytime. I am not a fan on how R has na.rm, na.exclude, na.omit etc.
(ok - and I would appreciate a comment in #2314 what you think should be the results on the cases I have listed there :smile:)
I was thinking about these things and I would close #2211, #2314 and #2258 in favour of this issue which I would aim to solve all the issues raised in one consistent design.
So my proposal would be the following:
- in
filter
anddropmissing
functions if they takeDataFrame
orSubDataFrame
with indexIndex
(not just anyAbstractDataFrame
, and not forfilter!
anddropmissing!
) addview::Bool=flase
keyword argument. Of courseview
also allows to create it - if
view=true
thenSubDataFrame
is created withIndex
- We would add support for
!
functions for suchSubDataFrame
(now it is an error) (here is a big task to list all that we feel it makes sense to support) and several more special cases:-
setproperty!
,setindex!
,select!
,transform!
,sort!
,filter!
,dropmissing!
(where the schema of the parent is allowed to be updated by the operation with the rules that rows excluded byview
are left untouched and if new columns are created they are filled withmissing
); if such functions return a source data frame thenSubDataFrame
does not get unwrapped and callingparent
to require explicit unwrapping is required (this is because otherwise it is problematic in my opinion to ensure consistent and intuitive for entry-level users behaviour betweentransform
andtransform!
) - this is what I would call "sticky behaviour" - in particular it is allowed to do
groupby
onSubDataFrame
withIndex
and it will have the same rules forselect!
andtransform!
as described above
-
So the only thing that does not go as you want is "sticky behavior", which I think we should have because of two reasons:
- you can then chain operations and you do not have to filter rows again
- we get a consistent behavior between
!
and no-!
functions; which I believe will be easier to understand for users
The decision to be made is if we want automatic column promotion for such SubDataFrame
, but I think we do not want it and an error should be thrown in cases where promotion would be needed (this is what we currently do in other places).
A side benefit is that it will be much easier to implement it than doing all new stuff for WhereDataFrame
and users will not have to learn one more type, which we already have many.
Now how it addresses the issues I mention:
- #2211 : it will just be what is requested there
- #2314 : using
dropmissing
withview=true
will just do this, and it would be sticky as requested by @nalimilan - #2258 : it will just be what is requested there except that unwrapping will require calling
parent
after the operation, and by default we are sticky (I think it is not a great problem)
What do you think?
Makes sense. But I'm not sure it would really fix #2258 and #2314, as repeating dropmissing(df, view=true)
everywhere is still quite verbose, and you'd have to repeat the columns for which you want to drop missing values. Recommending that pattern as the standard way to skip missing values could even be dangerous, as people would be likely to do transform(dropmissing(df, view=true), :a => ... => :a_new)
when they mean transform(dropmissing(df, :a, view=true), :a => ... => :a_new)
.
Makes sense. But I'm not sure it would really fix #2258 and #2314, as repeating
dropmissing(df, view=true)
everywhere is still quite verbose, and you'd have to repeat the columns for which you want to drop missing values.
I agree with @nalimilan. I think I have been confusing two distinct issues in my comments, which I apologize for.
- Operations like "create a variable that is the average income tax payed by people, but only for women". For this you would use a
view
with all of the operations you described above. - Operations on data with many missing values, where you don't want to use
passmissing
andskipmissing
everywhere. Actually doing aview
here would require a lot of repeating which variables are the problem. For this you would use theskipmissing = true
keyword argument, potentially.
Ah - OK. So for #2258 and #2314 it would be a "partial fix" (via a more general mechanism, which has it as a special case, but I agree we can think of more convenient patterns).
So do you think what I propose is OK for resolving #2211 (and having a partial solution for the other issues does not hurt, but of course let us discuss more convenient options)?
If the goal is to be able to make it easier to replace certain rows of a column if a condition is satisfied, I am not sure it is worth it.
That is because I do not think that the pattern
parent(transform!(filter(:x => >(1), df, view = true), :x => 1)))
is simpler than
transform!(df, :x => x-> ifelse.(x .>= 1, 1, x))
Yes, but all that was asked for in #2211 can be handled by ifelse
and this is the currently (i.e. before adding views) recommended pattern.
@pdeffebach - given the discussion we have what use cases of #2211 do you see?
(even if we do not add #2211 functionality, still we can add view
kwarg to filter
and dropmissing
as this is useful in its own rights, as it is a memory efficient way to filter a data frame)
@pdeffebach - given the discussion we have what use cases of #2211 do you see?
I agree with you both. i don't think this is that important.
I do like the way stata reads gen x if y > z
. But that convenience can be handled by DataFramesMeta potentially.
Therefore I agree that we should continue thinking about better ways to make skipmissing
work easily inside transform
.
OK - so do we close #2211, or keep it open to get back to it in the future? (for me it is easier not to have #2211 as it will complicate codebase because it requires special handling in many functions, but if you feel we can go back to it some day then let us keep it open)
we should continue thinking about better ways to make
skipmissing
work easily insidetransform
.
Agreed - and thank you for spending your time on this.
Let's close it if it's easiest. I think DataFramesMeta can handle a lot of it's points.
Now thinking of it I realized that the issue also is that ifelse
is limited in what you can express with it, and filter
can take a quite complex predicate. But maybe this does not justify having it.
Also ifelse
is eager, so it evaluates left and right hand side therefore the following fails:
julia> df = DataFrame(x = -3:3)
7×1 DataFrame
│ Row │ x │
│ │ Int64 │
├─────┼───────┤
│ 1 │ -3 │
│ 2 │ -2 │
│ 3 │ -1 │
│ 4 │ 0 │
│ 5 │ 1 │
│ 6 │ 2 │
│ 7 │ 3 │
julia> transform(df, :x => x-> ifelse.(x .> 0, log.(x), missing))
ERROR: DomainError with -3.0:
log will only return a complex result if called with a complex argument. Try log(Complex(x)).
Yes that is one problem. I've had that before and been frustrated by it.
We really need a lazy version of this in Base to be honest...
Another issue is that that "generate a standardized income index, among women, for all women" is tough.
ifelse.(woman == 1, standardized_index(income, wealth), missing)
The income
and wealth
aren't subsetted by gender above, whereas a View
of the data frame would fix that problem.
Then again, I also am not sure what
gen income_index = ... if woman == 1
does in stata. I don't know how the variables are subsetted.
Right, I did not think about that . The ifelse pattern can only be used in transform for functions that are elementwise, not functions that compute reductions.
But ultimately I agree that this kind of ifelse pattern could be handled by a macro or some sort of struct with lazy broadcasting. It's a tool that has general convenience beyond dataframes and thus doesn't have to live here.
Yes, but it should not be ifelse
but something else. ifelse
is eager on purpose, as it makes it fast, because then it avoids branching in generated native code.
Given the decision in #2314 do we need the functionality described in #2211 or not?
Or maybe it is enough if we have this functionality for setindex!
, insertcols!
and broadcasted assignment (this would be relatively easy) but keep disallowing it in transform!
and select!
(this will be more complex both for SubDataFrame
and GroupedDataFrame{SubDataFrame}
if SubDataFrame
is based on Index
)?
(i.e. do we need mutating view
s - note that I recently noticed that we already had this for rename!
but it was not intentional - now it is documented)