LabelEncoder raises errors with string and string[pyarrow] types
Describe the issue:
Using the label encode with string and string[pyarrow] dtypes raises TypeErrors. The error is getting raised in dask.
File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/array/utils.py", line 57, in meta_from_array
x = x(shape=(0,) * (ndim or 0), dtype=dtype)
TypeError: Cannot interpret 'string[pyarrow]' as a data type
...But I can't trigger this without the LabelEncoder.
Minimal Complete Verifiable Example:
import dask.dataframe as dd
from dask_ml.preprocessing import LabelEncoder
import pandas as pd
df = pd.DataFrame({
"A": list("aaaabbbcccdddeeefffgggg")
})
ddf = dd.from_pandas(df, npartitions=3)
# works
lenc = LabelEncoder().fit(ddf["A"])
# TypeError: Cannot interpret 'string[pyarrow]' as a data type
lenc = LabelEncoder().fit(ddf["A"].astype("string[pyarrow]"))
Full Traceback:
Very similar tracebacks for string and string[pyarrow]
Traceback (most recent call last): File "/Users/paul/work/sources/dask-engineering/example-pipelines/criteo-HPO/label_encoder_repro.py", line 12, inlenc = LabelEncoder().fit(ddf["A"].astype("string[pyarrow]")) File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask_ml/preprocessing/label.py", line 115, in fit y = self._check_array(y) File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask_ml/preprocessing/label.py", line 111, in _check_array y = y.to_dask_array(lengths=True) File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/dataframe/core.py", line 1687, in to_dask_array arr = self.values File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/dataframe/core.py", line 3431, in values return self.map_partitions(methods.values) File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/dataframe/core.py", line 874, in map_partitions return map_partitions(func, self, *args, **kwargs) File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/dataframe/core.py", line 6701, in map_partitions return new_dd_object(graph, name, meta, divisions) File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/dataframe/core.py", line 7835, in new_dd_object return da.Array(dsk, name=name, chunks=chunks, dtype=meta.dtype) File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/array/core.py", line 1335, in __new__ meta = meta_from_array(meta, dtype=dtype) File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/array/utils.py", line 57, in meta_from_array x = x(shape=(0,) * (ndim or 0), dtype=dtype) TypeError: Cannot interpret 'string[pyarrow]' as a data type
Environment:
- Dask version: 2012.12.0
- Python version: 3.10
- Operating System: M1 Mac
- Install method (conda, pip, source): conda
Thanks for reporting @phobson! This looks like a known issue over in Dask where we're not able to hand off a Dask DataFrame using pandas extension dtypes to Dask array (xref https://github.com/dask/dask/issues/9401, https://github.com/dask/dask/issues/5001).
As a side note, I think scikit-learn is doing a better job of returning pandas outputs if pandas objects are used as inputs. I wonder if dask-ml could do something similar to avoid the need for a Dask DataFrame -> Dask Array conversion. cc @mmccarty for visibility