sklearn-pandas
sklearn-pandas copied to clipboard
LabelEncoder + Imputer + LabelBinarizer error
Hi,
I'm having an error while using a LabelEncoder + Imputer + LabelBinarizer in a mapper, as a LabelEncoder output is a vector of (n_samples,) so Imputer, that calls sklearn function check_array, that calls numpy funciont atleast_2d, transforms it to (1,n_samples), so LabelBinarizer crashes:
ValueError: Multioutput target data is not supported with label binarization
How can I fix this issue?
Many thanks!
My recommendation here is to create a subclass of LabelEncoder
that transforms the output to a 2-d vector (n_samples, 1)
in the proper conditions so all transformers are of the same type and compatible.
If you come up with that implementation please post it here in a PR as it is suitable to be included with sklearn-pandas.
@paubelsan @dukebody If the proposal for this enhancement is still actual, and nobody works on it right now, I could make a try. Though I am not sure how to replicate the issue, I am getting a different exception when trying to apply sequence [LabelEncoder(), Imputer(), LabelBinarizer()]
, namely:
ValueError: col: Expected 2D array, got 1D array instead
And, not on LabelBinarizer
step, but while imputing values.
@devforfu I guess that @paubelsan might have different versions of numpy/pandas, but the issue looks the same to me: LabelEncoder()
returns a 1-d vector, while other transformers expect 2-d vectors.
I kind of remember some conversations about creating a transformer ([CategoricalEncoder](http://contrib.scikit-learn.org/categorical-encoding/)
?) in sklearn
to do what LabelEncoder()
does but generating 2-d vectors, for arbitrary 2-d data, including strings. I'd check this linked project and, if it doesn't fit what we want, then we can implement our own version.
Since this is a problem that is likely still encountered by many it may be good to write here that in the dev-0.20 version of sklearn OneHotEncoder directly supports categorical inputs without using LabelEncoder
. I think this mostly resolves all issues regarding encoding using sklearn-pandas.
I'm not sure if anyone is still experiencing this problem in light of recent updates to sklearn but if you have a list of categorical variable keys you can do something like
DataFrameMapper([(c, LabelBinarizer()) for c in categorical]+[(n, None) for n in df.columns if n not in categorical])
Hopefully this can be helpful to somebody