dask-ml
dask-ml copied to clipboard
Bug in ColumnTransformer
I have a straightforward usecase to label encode some columns, onehot encode some columns and passthrough some columns in a pandas df (drop remainder)
Code:
from dask_ml.compose import ColumnTransformer
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
df = pd.read_csv('path/to/csv')
ordinal_cols = [<list of ordinal columns>]
nominal_cols = [<list of nominal columns>]
passthrough_cols = [<list of passthrough columns>]
transformers = [
("ordinal_encoding", OrdinalEncoder(), ordinal_cols),
("onehot_encoding", OneHotEncoder(), nominal_cols),
('select', 'passthrough', passthrough_cols)
]
preprocessor = ColumnTransformer(transformers=transformers)
df_t = preprocessor.fit_transform(df)
this failed with the Traceback
Traceback (most recent call last):
File ".../helpers/pydev/pydevd.py", line 1496, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File ".../python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File ".../dask_testing.py", line 80, in <module>
df_t = preprocessor.fit_transform(df)
File ".../lib/python3.8/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
File ".../lib/python3.8/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
File ".../lib/python3.8/site-packages/sklearn/compose/_column_transformer.py", line 750, in fit_transform
return self._hstack(list(Xs))
File ".../lib/python3.8/site-packages/dask_ml/compose/_column_transformer.py", line 198, in _hstack
return pd.concat(Xs, axis="columns")
File ".../lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File ".../lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 368, in concat
op = _Concatenator(
File ".../lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 458, in __init__
raise TypeError(msg)
TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid
On further debugging the output from the three steps in the transformer give 3 different types of outputs.
- OrdinalEncoder() gives a 2darray
- OneHotEncoder() gives a csr_matrix
- "passthrough" gives a dataframe
Point where it is failing in dask-ml package is .../python3.8/site-packages/dask_ml/compose/_column_transformer.py
line 198
where it is trying to concat the three different types into a an output df
Code snippet:
elif self.preserve_dataframe and (pd.Series in types or pd.DataFrame in types):
return pd.concat(Xs, axis="columns")
Anything else we need to know?: Shape of my data is (1000, 1076) label encoding 109 ccolumns onehot encoding 1 column passthrough the rest of the columns
I do not want to use remainder="passthrough" param, I want to pass it in the transformers list
Environment:
- Dask version:
dask 2023.1.0
dask-glm 0.2.0
dask-ml 2022.5.27
- Python version: 3.8
- Operating System: MacOS
- Install method (conda, pip, source): pip