pandas2 Make NA/null a first-class citizen in groupby operations

Make NA/null a first-class citizen in groupby operations

Open wesm opened this issue 8 years ago • 3 comments

xref #9

Maybe we can collect a list of pandas issues that have happened in and around this.

https://github.com/pydata/pandas/issues/14170

I've found it's valuable to be able to consistently compute statistics including the NA values, especially with multiple group keys. I haven't kept track of how pandas handles these now in all cases, but it would be nice to come up with a strategy to make NA behave like any other group in a group by setting.

Sep 07 '16 15:09 wesm

This problem also extends to other analytics, like value_counts:

In:
s = pd.Series([1, 2, np.nan, 1, 1, 2, np.nan])
s.value_counts()

Out:
1.0    3
2.0    2
dtype: int64

Here, NA should appear in the result and indicate 2 values. Same goes for groupby(...).size()

Sep 19 '16 17:09 wesm

In the specific case of value_counts, there is the dropna keyword which does this:

In [15]: s.value_counts(dropna=False)
Out[15]: 
 1.0    3
NaN     2
 2.0    2
dtype: int64

But of course that does not dismiss the bigger problem with groupby and others (and you could also argue whether dropna=False would be a better default value ..)

Sep 19 '16 17:09 jorisvandenbossche

It's linked in the top issue, but just for visibility, https://github.com/pydata/pandas/pull/12607 is a WIP PR that would add the dropna keyword arg to groupby.

Sep 19 '16 17:09 chris-b1

pandas2 pandas2 copied to clipboard

Make NA/null a first-class citizen in groupby operations

pandas2
pandas2 copied to clipboard