cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[BUG] Cannot find maxima of a categorical series

Open rjzamora opened this issue 9 months ago • 5 comments

Describe the bug While debugging some strange dask-expr + cudf sorting behavior, I realized that I cannot call ser.min() when ser is a categorical Series. This is a problem for the sort_values logic used in dask.

Steps/Code to reproduce bug

import cudf

ser = cudf.Series(range(10), dtype="category")
ser.min()
# Same problem with `ser.cat.as_ordered().min()`
...
TypeError: Cannot interpret 'CategoricalDtype(categories=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], ordered=False, categories_dtype=int64)' as a data type

Expected behavior I'd expect for min/max to work (assuming the dtype is "ordered").

rjzamora avatar May 02 '24 23:05 rjzamora

Hmmm. Can you try:

diff --git a/python/cudf/cudf/core/column/categorical.py b/python/cudf/cudf/core/column/categorical.py
index e3e7303504..fc996e6b6a 100644
--- a/python/cudf/cudf/core/column/categorical.py
+++ b/python/cudf/cudf/core/column/categorical.py
@@ -515,6 +515,10 @@ class CategoricalColumn(column.ColumnBase):
     dtype: cudf.core.dtypes.CategoricalDtype
     _codes: Optional[NumericalColumn]
     _children: Tuple[NumericalColumn]
+    _VALID_REDUCTIONS = {
+        "max",
+        "min",
+    }
     _VALID_BINARY_OPERATIONS = {
         "__eq__",
         "__ne__",
@@ -699,6 +703,27 @@ class CategoricalColumn(column.ColumnBase):
             ),
         )
 
+    def _reduce(
+        self,
+        op: str,
+        skipna: Optional[bool] = None,
+        min_count: int = 0,
+        *args,
+        **kwargs,
+    ) -> ScalarLike:
+        # Only valid reductions are min and max
+        if not self.ordered:
+            raise TypeError(
+                "Categorical is not ordered for operation min "
+                "you can use .as_ordered() to change the Categorical "
+                "to an ordered one."
+            )
+        return self._encode(
+            self.codes._reduce(
+                op=op, skipna=skipna, min_count=min_count, *args, **kwargs
+            )
+        )
+
     def _binaryop(self, other: ColumnBinaryOperand, op: str) -> ColumnBase:
         other = self._wrap_binop_normalization(other)
         # TODO: This is currently just here to make mypy happy, but eventually

Aside, it is mind-boggling to me that unordered_cat.sort_values() is allowed, but unordered_cat.min() is not. How can you sort if you can't produce a minimum element?

wence- avatar May 03 '24 10:05 wence-

@rjzamora any chance you tried out @wence- 's snippet above?

Aside, it is mind-boggling to me that unordered_cat.sort_values() is allowed, but unordered_cat.min() is not. How can you sort if you can't produce a minimum element?

A weak ordering and we don't like ties??? I don't know, I agree that this seems nonsensical on its face.

vyasr avatar May 07 '24 23:05 vyasr

any chance you tried out @wence- 's snippet above?

Not yet - Thanks for suggestion @wence-! I will test this out soon.

rjzamora avatar May 08 '24 02:05 rjzamora

A weak ordering and we don't like ties??? I don't know, I agree that this seems nonsensical on its face.

I did some digging, it's a combination of matching R's factor API and implementation leaking into semantics. See discussions https://github.com/pandas-dev/pandas/pull/9611, https://github.com/pandas-dev/pandas/pull/9622, and https://github.com/pandas-dev/pandas/issues/12785

Effectively by having sort-based groupby be a promise, you're backed into a corner by having unordered categoricals as groupby keys. Of course, the right answer is to say "no can do", but ...

wence- avatar May 08 '24 08:05 wence-

Added a test to the patch suggested by @wence- and submitted https://github.com/rapidsai/cudf/pull/15701

rjzamora avatar May 08 '24 14:05 rjzamora