sklearn-pandas icon indicating copy to clipboard operation
sklearn-pandas copied to clipboard

Add DataFrameMapper.get_feature_names (wrapper for transformed_features_)

Open molaxx opened this issue 8 years ago • 13 comments

As this function is sort of the de facto standard in sklearn (implemented in FeatureUnion, CountVectorizer, PolynomialFeatures, DictVectorizer) , it would reduce friction when using DataFrameMapper.

molaxx avatar Jul 04 '17 15:07 molaxx

I think it's a good idea.

The only consideration is that transformed_features_ is filled when DataFrameMapper.transform is called and it could change after every subsequent transform call.

While in contrast, the sklearn's get_feature_names() only needs the transformer to be fitted, and the features names doesn't change after a transform call.

arnau126 avatar Jul 12 '17 10:07 arnau126

@molaxx thanks for the suggestion, I also think it's a good idea, with the caveat that @arnau126 mentions. Can you submit a PR with the implementation and some modifications to the README to indicate the availability of the method? Thanks.

dukebody avatar Jul 29 '17 08:07 dukebody

@arnau126 this behavior seems like a bug. Why would we want the features to change after each transform? Further more, after fitting and pickling a model, the feature set used for training is lost. I think this logic should move to DataFrameMapper.fit. What do you think?

molaxx avatar Aug 06 '17 16:08 molaxx

Currently it's not possible to move this logic to fit because get_names (the function used to build transformed_features_) needs the transformed columns.

And it needs them because: - not all transformers have get_feature_names() or classes_. - sometimes classes_ doesn't contain the feature names.

https://github.com/pandas-dev/sklearn-pandas/blob/master/sklearn_pandas/dataframe_mapper.py#L241

arnau126 avatar Aug 10 '17 15:08 arnau126

Couldn't this be handled by transforming a single row after the fitting? It's a bit hacky, but not having feature names after a fit is a bit surprising. I like boring API's :) Plus, I'll probably end up doing this manually so the features names would be available right after unpickling the model.

molaxx avatar Aug 13 '17 21:08 molaxx

@molaxx I don't like the idea of transforming just one row to be able to get the feature names, it's too hacky. I understand it can be surprising that one needs to transform the data to be able to get the column names, but this is due to the complex nature of the custom transformers.

What we can do is to try to get these from the last transformer for each column during fit, like FeatureUnion does, and fail if they cannot be extracted, with a message indicating that one has to transform first to get inferred column names in that cases.

Are you up for PR such a feature?

dukebody avatar Aug 20 '17 11:08 dukebody

Ok. Sounds good. I'll find time to work on it. Can you point me to a transformer that does not allow getting feature names post fit so i can test my solution?

On Sun, 20 Aug 2017 at 14:19 Israel Saeta Pérez [email protected] wrote:

@molaxx https://github.com/molaxx I don't like the idea of transforming just one row to be able to get the feature names, it's too hacky. I understand it can be surprising that one needs to transform the data to be able to get the column names, but this is due to the complex nature of the custom transformers.

What we can do is to try to get these from the last transformer for each column during fit, like FeatureUnion does https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/pipeline.py#L684, and fail if they cannot be extracted, with a message indicating that one has to transform first to get inferred column names in that cases.

Are you up for PR such a feature?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/sklearn-pandas/issues/109#issuecomment-323578727, or mute the thread https://github.com/notifications/unsubscribe-auth/AKiHqPqLDKkKdndCkiF1FrnDb3ZExb_Qks5saBY0gaJpZM4ONd0s .

molaxx avatar Aug 20 '17 21:08 molaxx

I believe that any transformer that doesn't have a classes_ attribute or a get_feature_names method, which are the ones that _get_feature_names leverages, will work for the test. You can create a mock one in the tests.

dukebody avatar Sep 03 '17 14:09 dukebody

@molaxx OneHotEncoder is an example of a transformer without classes_ or get_feature_names(). See http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

zouzias avatar Feb 05 '18 08:02 zouzias

@arnau126 this behavior seems like a bug. Why would we want the features to change after each transform? Further more, after fitting and pickling a model, the feature set used for training is lost. I think this logic should move to DataFrameMapper.fit. What do you think?

Just popping in to say that I just spent ages trying to debug different numbers of columns in my training and test sets because it turns out the test set had lan extra label in a column that was being one-hot encoded. It would have been way easier to have some exception along the lines of "Number of columns from feature <feature> after transformation does not match last call" or something like that.

JohnPaton avatar Mar 28 '19 11:03 JohnPaton

What's the status on this?

Is there any known workaround?

iDmple avatar Oct 18 '19 15:10 iDmple

@JohnPaton , @iDmple can you provide a simple example that I can use to build and test the solution.

ragrawal avatar May 08 '21 09:05 ragrawal

Sorry, that was 2 years ago, I don't have it lying around now 😕

JohnPaton avatar Jun 02 '21 11:06 JohnPaton