DataFrames.jl
DataFrames.jl copied to clipboard
Opposite of skipmissing=true
subset
takes skipmissing=
:
If skipmissing=false (the default) args are required to produce vectors containing only Bool values. If skipmissing=true, additionally missing is allowed and it is treated as false (i.e. rows for which one of the conditions returns missing are skipped).
This means that the convenient options are error or exclude but there is no include option. On the other hand matchmissing
has all three options: :equal/:notequal/:error
. Sometimes I want to include the missing values in the result of subset
. I can do it with coalesce
but that is verbose.
Thank you for the comment. Indeed this is a bit problematic. The context is that matchmissing
is a new kwarg, while skipmissing
is also available in Base Julia and it is a keyword in groupby
.
We could make skipmissing=false
keep missings, and make skipmissing=nothing
as a third option that errors that is the default (this is what we did with sort
in groupby
) -> it would be slightly breaking, but I think we could classify it as an acceptable change, since skipmissing=false
lexically means that one wants to keep missing (and this is what happens in groupby
so now we have an inconsistency between subset
and groupby
); I think also that making this change is not "risky".
@nalimilan, @pdeffebach - what do you think?
I'd find it a bit misleading to have skipmissing=false
mean "treat missing
as true
", as it's a more radical decision than simply letting missing
pass through. for that reason I don't think there's really an inconsistency with groupby
.
We could use other values or an additional argument, but if so we should find a pattern that works for several functions.
On the other hand
matchmissing
has all three options::equal
/:notequal
/:error
.
Indeed, because we felt that three options are valid in joins. Note that kwarg is named matchmissing
because of this.
Sometimes I want to include the
missing
values in the result ofsubset
. I can do it withcoalesce
but that is verbose.
The issue is that skipmissing
is a function in Base Julia that skips missing, thus the Bool
behavior of skipping missing if this is set to true
.
@nalimilan - we use skipmissing
in this context only in subset
. Other use of it is in groupby
, but there two-valued logic makes sense. So this means that this discussion is largely only about subset
.
What we could add is keepmissing::Bool
kwarg. And if keepmissing=true
and skipmissing=true
we would throw an error.
What do you think?
Yeah. The other option would be to add something like missing=:skip/:keep/:error
. The advantage would be that the same pattern could be used with groupby
(consistency is important, in R skipping missings is a total mess and that's annoying). But I'm not sure there are other functions where it would make sense in the ecosystem. missing=:skip/:keep/:error
would make sense for freqtable
in FreqTables, but for pairwise
in StatsBase we would need missing=:skippairwise/:skiplistwise/:keep/:error
, where most values are different and it's a bit verbose. Also symbols are annoying to work with in DataFramesMeta as they refer to columns when not escaped.
This is a valid point, but for subset
exactly the interaction with DataFramesMeta.jl is important as you note, where using Symbol
is not very convenient.
I think what we have so far is consistent, i.e. passing skipmissing=true
skips missing everywhere in the ecosystem.
Also I think if we add keepmissing
it would be made consistent also. E.g. we could add keepmissing
in groupby
also without a problem.
The only difference would be that keepmissing=true
would be the default in groupby
, while in subset
the default would be keepmissing=false
(for backward compatibility).
Then we would have a table:
- keep missing:
keepmissing && !skipmissing
- skip missing:
!keepmissing && skipmissing
- error if missing is present:
!keepmissing && !skipmissing
- wrong call
keepmissing && skipmissing
So, given the two options, I would prefer to add Bool
keepmissing
kwarg rather than Symbol
missing
kwarg (adding missing
kwarg would mean that skipmissing
kwarg should be deprecated, but this would be a minor issue).