dask-ml Categorizer does not preserve order of categories for Pandas != 1.2

For Pandas<1.2, the Categorizer does not always preserve the order of categories (due to a bug in pd.series.astype, see e.g., https://github.com/pandas-dev/pandas/issues/30206).

Example:

import pandas as pd
from dask_ml.preprocessing import Categorizer

X1 = pd.DataFrame({"x": pd.Categorical(["a"], categories=["a", "b"])})
X2 = pd.DataFrame({"x": pd.Categorical(["a"], categories=["b", "a"])})

categorizer = Categorizer().fit(X1)

categorizer.transform(X1)["x"].dtype
# > CategoricalDtype(categories=['a', 'b'], ordered=False)

categorizer.transform(X2)["x"].dtype
# > CategoricalDtype(categories=['b', 'a'], ordered=False)

For Pandas>=1.2, the above code snippet produces the same result for X1 and X2 (as we would expect).

This behavior is caused by this call to pd.series.astype:

https://github.com/dask/dask-ml/blob/0ea276da1d78db582f40e1c256dfca4f70e6cbc6/dask_ml/preprocessing/data.py#L568

Pandas-only example:

x1 = pd.Series(pd.Categorical(["a"], categories=["a", "b"]))
x2 = pd.Series(pd.Categorical(["a"], categories=["b", "a"]))
x2.astype(x1.dtype)
# > 0    a
# > dtype: category
# > Categories (2, object): ['b', 'a']

Again, I would expect that astype enforces the order (but that only happens for pandas>=1.2).

Question

Is it worth fixing this for Pandas<1.2? This can cause issues for downstream estimators where the order of categories matters (e.g. because they're used for one-hot encoding of some sort). I would argue that Pandas<1.2 is still pretty common.

I'd be happy to contribute a fix.

Another question is if one should ever rely on the order of categories in Pandas categorical types...

Environment:

Dask version: 2021.4.0
Python version: 3.8.8
Operating System: osx
Install method (conda, pip, source): source (1.8.1.dev19+g0ea276da)

May 02 '21 16:05 jtilly

Actually:

With Pandas=1.2.4:

import pandas as pd
x1 = pd.Series(pd.Categorical(["a"], categories=["a", "b"]))
x2 = pd.Series(pd.Categorical(["a"], categories=["b", "a"]))
x2.astype(x1.dtype)
# > 0    a
# > dtype: category
# > Categories (2, object): ['a', 'b']

On the current Pandas master (526468):

# > 0    a
# > dtype: category
# > Categories (2, object): ['b', 'a']

Courtesy of https://github.com/pandas-dev/pandas/commit/ef349ca2ba28a1314e1bbdddacaf46be89ed430b

For the Categorizer, it would be great to enforce the order in transform(), so that downstream estimators can use the cat codes for doing their work.

May 02 '21 17:05 jtilly

Another question is if one should ever rely on the order of categories in Pandas categorical types...

Perhaps be explicit about it and cast to ordered categorical in transform?

May 02 '21 19:05 lbittarello

Another question is if one should ever rely on the order of categories in Pandas categorical types...

Only if the categorical is ordered.

What does the proposed fixed behavior look like? I wouldn't want to do anything differently than pandas.

May 03 '21 01:05 TomAugspurger

What does the proposed fixed behavior look like?

I would just write

X[k] = X[k].astype(dtype)
if not X[k].cat.categories.equals(dtype.categories):
    X[k] = X[k].cat.reorder_categories(dtype.categories)

instead of

https://github.com/dask/dask-ml/blob/0ea276da1d78db582f40e1c256dfca4f70e6cbc6/dask_ml/preprocessing/data.py#L568

May 03 '21 10:05 jtilly

dask-ml dask-ml copied to clipboard

Categorizer does not preserve order of categories for Pandas != 1.2

dask-ml
dask-ml copied to clipboard