DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Opposite of skipmissing=true

Open jariji opened this issue 1 year ago • 5 comments

subset takes skipmissing=:

If skipmissing=false (the default) args are required to produce vectors containing only Bool values. If skipmissing=true, additionally missing is allowed and it is treated as false (i.e. rows for which one of the conditions returns missing are skipped).

This means that the convenient options are error or exclude but there is no include option. On the other hand matchmissing has all three options: :equal/:notequal/:error. Sometimes I want to include the missing values in the result of subset. I can do it with coalesce but that is verbose.

jariji avatar Jul 15 '22 06:07 jariji

Thank you for the comment. Indeed this is a bit problematic. The context is that matchmissing is a new kwarg, while skipmissing is also available in Base Julia and it is a keyword in groupby.

We could make skipmissing=false keep missings, and make skipmissing=nothing as a third option that errors that is the default (this is what we did with sort in groupby) -> it would be slightly breaking, but I think we could classify it as an acceptable change, since skipmissing=false lexically means that one wants to keep missing (and this is what happens in groupby so now we have an inconsistency between subset and groupby); I think also that making this change is not "risky".

@nalimilan, @pdeffebach - what do you think?

bkamins avatar Jul 15 '22 06:07 bkamins

I'd find it a bit misleading to have skipmissing=false mean "treat missing as true", as it's a more radical decision than simply letting missing pass through. for that reason I don't think there's really an inconsistency with groupby.

We could use other values or an additional argument, but if so we should find a pattern that works for several functions.

nalimilan avatar Aug 26 '22 07:08 nalimilan

On the other hand matchmissing has all three options: :equal/:notequal/:error.

Indeed, because we felt that three options are valid in joins. Note that kwarg is named matchmissing because of this.

Sometimes I want to include the missing values in the result of subset. I can do it with coalesce but that is verbose.

The issue is that skipmissing is a function in Base Julia that skips missing, thus the Bool behavior of skipping missing if this is set to true.

@nalimilan - we use skipmissing in this context only in subset. Other use of it is in groupby, but there two-valued logic makes sense. So this means that this discussion is largely only about subset.

What we could add is keepmissing::Bool kwarg. And if keepmissing=true and skipmissing=true we would throw an error. What do you think?

bkamins avatar Dec 27 '22 21:12 bkamins

Yeah. The other option would be to add something like missing=:skip/:keep/:error. The advantage would be that the same pattern could be used with groupby (consistency is important, in R skipping missings is a total mess and that's annoying). But I'm not sure there are other functions where it would make sense in the ecosystem. missing=:skip/:keep/:error would make sense for freqtable in FreqTables, but for pairwise in StatsBase we would need missing=:skippairwise/:skiplistwise/:keep/:error, where most values are different and it's a bit verbose. Also symbols are annoying to work with in DataFramesMeta as they refer to columns when not escaped.

nalimilan avatar Dec 29 '22 12:12 nalimilan

This is a valid point, but for subset exactly the interaction with DataFramesMeta.jl is important as you note, where using Symbol is not very convenient.

I think what we have so far is consistent, i.e. passing skipmissing=true skips missing everywhere in the ecosystem. Also I think if we add keepmissing it would be made consistent also. E.g. we could add keepmissing in groupby also without a problem.

The only difference would be that keepmissing=true would be the default in groupby, while in subset the default would be keepmissing=false (for backward compatibility).

Then we would have a table:

  • keep missing: keepmissing && !skipmissing
  • skip missing: !keepmissing && skipmissing
  • error if missing is present: !keepmissing && !skipmissing
  • wrong call keepmissing && skipmissing

So, given the two options, I would prefer to add Bool keepmissing kwarg rather than Symbol missing kwarg (adding missing kwarg would mean that skipmissing kwarg should be deprecated, but this would be a minor issue).

bkamins avatar Dec 29 '22 13:12 bkamins