LabelEncoder doesn't handle missing values in *dask* series of strings
Describe the issue:
When using a LabelEncoder on a dask series with missing values (as np.nan), a TypeError is raised with "<" being undefined for floats and strings.
scikit-learn's encoder seems to handle this well for pandas and dask series. We seem to handle it well with a pandas series.
Minimal Complete Verifiable Example:
import dask.dataframe as dd
from dask_ml.preprocessing import LabelEncoder as dask_le
from sklearn.preprocessing import LabelEncoder as skl_le
import numpy as np
import pandas as pd
df = pd.DataFrame({
"A": list("aaaabbbcccdddeeefffgggg")
})
df.loc[[0, 2, 5, 10, 21], "A"] = np.nan
ddf = dd.from_pandas(df, npartitions=3)
# works
lenc = skl_le().fit(df["A"])
lenc = skl_le().fit(ddf["A"])
lenc = dask_le().fit(df["A"])
# fails
lenc = dask_le().fit(ddf["A"])
# but also works
lenc = dask_le().fit(ddf["A"].fillna(""))
Full Trackback:
➜ python label_encoder_repro.py Traceback (most recent call last): File "/Users/paul/work/sources/dask-engineering/example-pipelines/criteo-HPO/label_encoder_repro.py", line 21, inlenc = dask_le().fit(ddf["A"]) File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask_ml/preprocessing/label.py", line 119, in fit self.classes_ = classes_.compute() File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/base.py", line 315, in compute (result,) = compute(self, traverse=False, **kwargs) File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/base.py", line 600, in compute results = schedule(dsk, keys, **kwargs) File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/threaded.py", line 89, in get results = get_async( File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/local.py", line 511, in get_async raise_exception(exc, tb) File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/local.py", line 319, in reraise raise exc File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/local.py", line 224, in execute_task result = _execute_task(task, data) File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/core.py", line 119, in _execute_task return func(*(_execute_task(a, cache) for a in args)) File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/optimization.py", line 990, in __call__ return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/core.py", line 149, in get result = _execute_task(task, cache) File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/core.py", line 119, in _execute_task return func(*(_execute_task(a, cache) for a in args)) File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/core.py", line 119, in return func(*(_execute_task(a, cache) for a in args)) File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/core.py", line 119, in _execute_task return func(*(_execute_task(a, cache) for a in args)) File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/utils.py", line 71, in apply return func(*args, **kwargs) File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/array/routines.py", line 1626, in _unique_internal u = np.unique(ar) File "<__array_function__ internals>", line 180, in unique File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/numpy/lib/arraysetops.py", line 274, in unique ret = _unique1d(ar, return_index, return_inverse, return_counts, File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/numpy/lib/arraysetops.py", line 336, in _unique1d ar.sort() TypeError: '
Environment:
- Dask version: 2012.12.0
- Python version: 3.10
- Operating System: M1 Mac
- Install method (conda, pip, source): conda
Tags: @phobson Hello, can I work on the issue titled "LabelEncoder doesn't handle missing values in dask series of strings #954".
@DuanBoomer I'd be happy to review a PR. Thanks for volunteering. Note that I'll be largely away from my computer this week through the New Year. So if my response time is slow, I haven't forgotten about you.
@phobson The PR will be submitted by Sunday if that's okay with you. Today is Monday.