pandas2
pandas2 copied to clipboard
Make NA/null a first-class citizen in groupby operations
xref #9
Maybe we can collect a list of pandas issues that have happened in and around this.
- https://github.com/pydata/pandas/issues/14170
I've found it's valuable to be able to consistently compute statistics including the NA values, especially with multiple group keys. I haven't kept track of how pandas handles these now in all cases, but it would be nice to come up with a strategy to make NA behave like any other group in a group by setting.
This problem also extends to other analytics, like value_counts
:
In:
s = pd.Series([1, 2, np.nan, 1, 1, 2, np.nan])
s.value_counts()
Out:
1.0 3
2.0 2
dtype: int64
Here, NA should appear in the result and indicate 2 values. Same goes for groupby(...).size()
In the specific case of value_counts
, there is the dropna
keyword which does this:
In [15]: s.value_counts(dropna=False)
Out[15]:
1.0 3
NaN 2
2.0 2
dtype: int64
But of course that does not dismiss the bigger problem with groupby and others (and you could also argue whether dropna=False
would be a better default value ..)
It's linked in the top issue, but just for visibility, https://github.com/pydata/pandas/pull/12607 is a WIP PR that would add the dropna
keyword arg to groupby
.