LabelEncoder raises errors with string and string[pyarrow] types

Open phobson opened this issue 3 years ago • 1 comments

Describe the issue:

Using the label encode with string and string[pyarrow] dtypes raises TypeErrors. The error is getting raised in dask.

  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/array/utils.py", line 57, in meta_from_array
    x = x(shape=(0,) * (ndim or 0), dtype=dtype)
TypeError: Cannot interpret 'string[pyarrow]' as a data type

...But I can't trigger this without the LabelEncoder.

Minimal Complete Verifiable Example:

import dask.dataframe as dd
from dask_ml.preprocessing import LabelEncoder
import pandas as pd

df = pd.DataFrame({
    "A": list("aaaabbbcccdddeeefffgggg")
})

ddf = dd.from_pandas(df, npartitions=3)

# works
lenc = LabelEncoder().fit(ddf["A"])  

# TypeError: Cannot interpret 'string[pyarrow]' as a data type
lenc = LabelEncoder().fit(ddf["A"].astype("string[pyarrow]"))

Full Traceback:

Very similar tracebacks for string and string[pyarrow]

➜ python label_encoder_repro.py

Traceback (most recent call last):
  File "/Users/paul/work/sources/dask-engineering/example-pipelines/criteo-HPO/label_encoder_repro.py", line 12, in 
    lenc = LabelEncoder().fit(ddf["A"].astype("string[pyarrow]"))
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask_ml/preprocessing/label.py", line 115, in fit
    y = self._check_array(y)
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask_ml/preprocessing/label.py", line 111, in _check_array
    y = y.to_dask_array(lengths=True)
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/dataframe/core.py", line 1687, in to_dask_array
    arr = self.values
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/dataframe/core.py", line 3431, in values
    return self.map_partitions(methods.values)
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/dataframe/core.py", line 874, in map_partitions
    return map_partitions(func, self, *args, **kwargs)
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/dataframe/core.py", line 6701, in map_partitions
    return new_dd_object(graph, name, meta, divisions)
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/dataframe/core.py", line 7835, in new_dd_object
    return da.Array(dsk, name=name, chunks=chunks, dtype=meta.dtype)
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/array/core.py", line 1335, in __new__
    meta = meta_from_array(meta, dtype=dtype)
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/array/utils.py", line 57, in meta_from_array
    x = x(shape=(0,) * (ndim or 0), dtype=dtype)
TypeError: Cannot interpret 'string[pyarrow]' as a data type

Environment:

Dask version: 2012.12.0
Python version: 3.10
Operating System: M1 Mac
Install method (conda, pip, source): conda

Dec 15 '22 19:12 phobson

Thanks for reporting @phobson! This looks like a known issue over in Dask where we're not able to hand off a Dask DataFrame using pandas extension dtypes to Dask array (xref https://github.com/dask/dask/issues/9401, https://github.com/dask/dask/issues/5001).

As a side note, I think scikit-learn is doing a better job of returning pandas outputs if pandas objects are used as inputs. I wonder if dask-ml could do something similar to avoid the need for a Dask DataFrame -> Dask Array conversion. cc @mmccarty for visibility

Dec 15 '22 19:12 jrbourbeau