dask-ml
dask-ml copied to clipboard
Categorizer does not preserve order of categories for Pandas != 1.2
For Pandas<1.2, the Categorizer does not always preserve the order of categories (due to a bug in pd.series.astype
, see e.g., https://github.com/pandas-dev/pandas/issues/30206).
Example:
import pandas as pd
from dask_ml.preprocessing import Categorizer
X1 = pd.DataFrame({"x": pd.Categorical(["a"], categories=["a", "b"])})
X2 = pd.DataFrame({"x": pd.Categorical(["a"], categories=["b", "a"])})
categorizer = Categorizer().fit(X1)
categorizer.transform(X1)["x"].dtype
# > CategoricalDtype(categories=['a', 'b'], ordered=False)
categorizer.transform(X2)["x"].dtype
# > CategoricalDtype(categories=['b', 'a'], ordered=False)
For Pandas>=1.2, the above code snippet produces the same result for X1
and X2
(as we would expect).
This behavior is caused by this call to pd.series.astype
:
https://github.com/dask/dask-ml/blob/0ea276da1d78db582f40e1c256dfca4f70e6cbc6/dask_ml/preprocessing/data.py#L568
Pandas-only example:
x1 = pd.Series(pd.Categorical(["a"], categories=["a", "b"]))
x2 = pd.Series(pd.Categorical(["a"], categories=["b", "a"]))
x2.astype(x1.dtype)
# > 0 a
# > dtype: category
# > Categories (2, object): ['b', 'a']
Again, I would expect that astype enforces the order (but that only happens for pandas>=1.2).
Question
Is it worth fixing this for Pandas<1.2? This can cause issues for downstream estimators where the order of categories matters (e.g. because they're used for one-hot encoding of some sort). I would argue that Pandas<1.2 is still pretty common.
I'd be happy to contribute a fix.
Another question is if one should ever rely on the order of categories in Pandas categorical types...
Environment:
- Dask version: 2021.4.0
- Python version: 3.8.8
- Operating System: osx
- Install method (conda, pip, source): source (1.8.1.dev19+g0ea276da)
Actually:
With Pandas=1.2.4:
import pandas as pd
x1 = pd.Series(pd.Categorical(["a"], categories=["a", "b"]))
x2 = pd.Series(pd.Categorical(["a"], categories=["b", "a"]))
x2.astype(x1.dtype)
# > 0 a
# > dtype: category
# > Categories (2, object): ['a', 'b']
On the current Pandas master (526468):
# > 0 a
# > dtype: category
# > Categories (2, object): ['b', 'a']
Courtesy of https://github.com/pandas-dev/pandas/commit/ef349ca2ba28a1314e1bbdddacaf46be89ed430b
For the Categorizer, it would be great to enforce the order in transform(), so that downstream estimators can use the cat codes for doing their work.
Another question is if one should ever rely on the order of categories in Pandas categorical types...
Perhaps be explicit about it and cast to ordered categorical in transform
?
Another question is if one should ever rely on the order of categories in Pandas categorical types...
Only if the categorical is ordered.
What does the proposed fixed behavior look like? I wouldn't want to do anything differently than pandas.
What does the proposed fixed behavior look like?
I would just write
X[k] = X[k].astype(dtype)
if not X[k].cat.categories.equals(dtype.categories):
X[k] = X[k].cat.reorder_categories(dtype.categories)
instead of
https://github.com/dask/dask-ml/blob/0ea276da1d78db582f40e1c256dfca4f70e6cbc6/dask_ml/preprocessing/data.py#L568