cudf
cudf copied to clipboard
[BUG] Cannot find maxima of a categorical series
Describe the bug
While debugging some strange dask-expr + cudf sorting behavior, I realized that I cannot call ser.min()
when ser
is a categorical Series
. This is a problem for the sort_values
logic used in dask.
Steps/Code to reproduce bug
import cudf
ser = cudf.Series(range(10), dtype="category")
ser.min()
# Same problem with `ser.cat.as_ordered().min()`
...
TypeError: Cannot interpret 'CategoricalDtype(categories=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], ordered=False, categories_dtype=int64)' as a data type
Expected behavior
I'd expect for min
/max
to work (assuming the dtype is "ordered").
Hmmm. Can you try:
diff --git a/python/cudf/cudf/core/column/categorical.py b/python/cudf/cudf/core/column/categorical.py
index e3e7303504..fc996e6b6a 100644
--- a/python/cudf/cudf/core/column/categorical.py
+++ b/python/cudf/cudf/core/column/categorical.py
@@ -515,6 +515,10 @@ class CategoricalColumn(column.ColumnBase):
dtype: cudf.core.dtypes.CategoricalDtype
_codes: Optional[NumericalColumn]
_children: Tuple[NumericalColumn]
+ _VALID_REDUCTIONS = {
+ "max",
+ "min",
+ }
_VALID_BINARY_OPERATIONS = {
"__eq__",
"__ne__",
@@ -699,6 +703,27 @@ class CategoricalColumn(column.ColumnBase):
),
)
+ def _reduce(
+ self,
+ op: str,
+ skipna: Optional[bool] = None,
+ min_count: int = 0,
+ *args,
+ **kwargs,
+ ) -> ScalarLike:
+ # Only valid reductions are min and max
+ if not self.ordered:
+ raise TypeError(
+ "Categorical is not ordered for operation min "
+ "you can use .as_ordered() to change the Categorical "
+ "to an ordered one."
+ )
+ return self._encode(
+ self.codes._reduce(
+ op=op, skipna=skipna, min_count=min_count, *args, **kwargs
+ )
+ )
+
def _binaryop(self, other: ColumnBinaryOperand, op: str) -> ColumnBase:
other = self._wrap_binop_normalization(other)
# TODO: This is currently just here to make mypy happy, but eventually
Aside, it is mind-boggling to me that unordered_cat.sort_values()
is allowed, but unordered_cat.min()
is not. How can you sort if you can't produce a minimum element?
@rjzamora any chance you tried out @wence- 's snippet above?
Aside, it is mind-boggling to me that unordered_cat.sort_values() is allowed, but unordered_cat.min() is not. How can you sort if you can't produce a minimum element?
A weak ordering and we don't like ties??? I don't know, I agree that this seems nonsensical on its face.
any chance you tried out @wence- 's snippet above?
Not yet - Thanks for suggestion @wence-! I will test this out soon.
A weak ordering and we don't like ties??? I don't know, I agree that this seems nonsensical on its face.
I did some digging, it's a combination of matching R's factor API and implementation leaking into semantics. See discussions https://github.com/pandas-dev/pandas/pull/9611, https://github.com/pandas-dev/pandas/pull/9622, and https://github.com/pandas-dev/pandas/issues/12785
Effectively by having sort-based groupby be a promise, you're backed into a corner by having unordered categoricals as groupby keys. Of course, the right answer is to say "no can do", but ...
Added a test to the patch suggested by @wence- and submitted https://github.com/rapidsai/cudf/pull/15701