sklearn-pandas icon indicating copy to clipboard operation
sklearn-pandas copied to clipboard

LabelEncoder + Imputer + LabelBinarizer error

Open paubelsan opened this issue 7 years ago • 5 comments

Hi,

I'm having an error while using a LabelEncoder + Imputer + LabelBinarizer in a mapper, as a LabelEncoder output is a vector of (n_samples,) so Imputer, that calls sklearn function check_array, that calls numpy funciont atleast_2d, transforms it to (1,n_samples), so LabelBinarizer crashes:

ValueError: Multioutput target data is not supported with label binarization

How can I fix this issue?

Many thanks!

paubelsan avatar May 15 '17 16:05 paubelsan

My recommendation here is to create a subclass of LabelEncoder that transforms the output to a 2-d vector (n_samples, 1) in the proper conditions so all transformers are of the same type and compatible.

If you come up with that implementation please post it here in a PR as it is suitable to be included with sklearn-pandas.

dukebody avatar Jun 10 '17 16:06 dukebody

@paubelsan @dukebody If the proposal for this enhancement is still actual, and nobody works on it right now, I could make a try. Though I am not sure how to replicate the issue, I am getting a different exception when trying to apply sequence [LabelEncoder(), Imputer(), LabelBinarizer()], namely:

ValueError: col: Expected 2D array, got 1D array instead

And, not on LabelBinarizer step, but while imputing values.

devforfu avatar Nov 12 '17 06:11 devforfu

@devforfu I guess that @paubelsan might have different versions of numpy/pandas, but the issue looks the same to me: LabelEncoder() returns a 1-d vector, while other transformers expect 2-d vectors.

I kind of remember some conversations about creating a transformer ([CategoricalEncoder](http://contrib.scikit-learn.org/categorical-encoding/)?) in sklearn to do what LabelEncoder() does but generating 2-d vectors, for arbitrary 2-d data, including strings. I'd check this linked project and, if it doesn't fit what we want, then we can implement our own version.

dukebody avatar Dec 23 '17 16:12 dukebody

Since this is a problem that is likely still encountered by many it may be good to write here that in the dev-0.20 version of sklearn OneHotEncoder directly supports categorical inputs without using LabelEncoder. I think this mostly resolves all issues regarding encoding using sklearn-pandas.

FlorisHoogenboom avatar Aug 06 '18 09:08 FlorisHoogenboom

I'm not sure if anyone is still experiencing this problem in light of recent updates to sklearn but if you have a list of categorical variable keys you can do something like

DataFrameMapper([(c, LabelBinarizer()) for c in categorical]+[(n, None) for n in df.columns if n not in categorical])

Hopefully this can be helpful to somebody

nabaskes avatar Oct 13 '18 20:10 nabaskes