sklearn-pandas Add DataFrameMapper.get_feature_names (wrapper for transformed_features

As this function is sort of the de facto standard in sklearn (implemented in FeatureUnion, CountVectorizer, PolynomialFeatures, DictVectorizer) , it would reduce friction when using DataFrameMapper.

Jul 04 '17 15:07 molaxx

I think it's a good idea.

The only consideration is that transformed_features_ is filled when DataFrameMapper.transform is called and it could change after every subsequent transform call.

While in contrast, the sklearn's get_feature_names() only needs the transformer to be fitted, and the features names doesn't change after a transform call.

Jul 12 '17 10:07 arnau126

@molaxx thanks for the suggestion, I also think it's a good idea, with the caveat that @arnau126 mentions. Can you submit a PR with the implementation and some modifications to the README to indicate the availability of the method? Thanks.

Jul 29 '17 08:07 dukebody

@arnau126 this behavior seems like a bug. Why would we want the features to change after each transform? Further more, after fitting and pickling a model, the feature set used for training is lost. I think this logic should move to DataFrameMapper.fit. What do you think?

Aug 06 '17 16:08 molaxx

Currently it's not possible to move this logic to fit because get_names (the function used to build transformed_features_) needs the transformed columns.

And it needs them because: - not all transformers have get_feature_names() or classes_. - sometimes classes_ doesn't contain the feature names.

https://github.com/pandas-dev/sklearn-pandas/blob/master/sklearn_pandas/dataframe_mapper.py#L241

Aug 10 '17 15:08 arnau126

Couldn't this be handled by transforming a single row after the fitting? It's a bit hacky, but not having feature names after a fit is a bit surprising. I like boring API's :) Plus, I'll probably end up doing this manually so the features names would be available right after unpickling the model.

Aug 13 '17 21:08 molaxx

@molaxx I don't like the idea of transforming just one row to be able to get the feature names, it's too hacky. I understand it can be surprising that one needs to transform the data to be able to get the column names, but this is due to the complex nature of the custom transformers.

What we can do is to try to get these from the last transformer for each column during fit, like FeatureUnion does, and fail if they cannot be extracted, with a message indicating that one has to transform first to get inferred column names in that cases.

Are you up for PR such a feature?

Aug 20 '17 11:08 dukebody

Ok. Sounds good. I'll find time to work on it. Can you point me to a transformer that does not allow getting feature names post fit so i can test my solution?

On Sun, 20 Aug 2017 at 14:19 Israel Saeta Pérez [email protected] wrote:

@molaxx https://github.com/molaxx I don't like the idea of transforming just one row to be able to get the feature names, it's too hacky. I understand it can be surprising that one needs to transform the data to be able to get the column names, but this is due to the complex nature of the custom transformers.

What we can do is to try to get these from the last transformer for each column during fit, like FeatureUnion does https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/pipeline.py#L684, and fail if they cannot be extracted, with a message indicating that one has to transform first to get inferred column names in that cases.

Are you up for PR such a feature?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/sklearn-pandas/issues/109#issuecomment-323578727, or mute the thread https://github.com/notifications/unsubscribe-auth/AKiHqPqLDKkKdndCkiF1FrnDb3ZExb_Qks5saBY0gaJpZM4ONd0s .

Aug 20 '17 21:08 molaxx

I believe that any transformer that doesn't have a classes_ attribute or a get_feature_names method, which are the ones that _get_feature_names leverages, will work for the test. You can create a mock one in the tests.

Sep 03 '17 14:09 dukebody

@molaxx OneHotEncoder is an example of a transformer without classes_ or get_feature_names(). See http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

Feb 05 '18 08:02 zouzias

@arnau126 this behavior seems like a bug. Why would we want the features to change after each transform? Further more, after fitting and pickling a model, the feature set used for training is lost. I think this logic should move to DataFrameMapper.fit. What do you think?

Just popping in to say that I just spent ages trying to debug different numbers of columns in my training and test sets because it turns out the test set had lan extra label in a column that was being one-hot encoded. It would have been way easier to have some exception along the lines of "Number of columns from feature <feature> after transformation does not match last call" or something like that.

Mar 28 '19 11:03 JohnPaton

What's the status on this?

Is there any known workaround?

Oct 18 '19 15:10 iDmple

@JohnPaton , @iDmple can you provide a simple example that I can use to build and test the solution.

May 08 '21 09:05 ragrawal

Sorry, that was 2 years ago, I don't have it lying around now 😕

Jun 02 '21 11:06 JohnPaton

sklearn-pandas
sklearn-pandas copied to clipboard

Add DataFrameMapper.get_feature_names (wrapper for transformed_features_)

sklearn-pandas sklearn-pandas copied to clipboard

Add DataFrameMapper.get_feature_names (wrapper for transformed_features_)

sklearn-pandas
sklearn-pandas copied to clipboard