sklearn-pandas
sklearn-pandas copied to clipboard
Add DataFrameMapper.get_feature_names (wrapper for transformed_features_)
As this function is sort of the de facto standard in sklearn (implemented in FeatureUnion, CountVectorizer, PolynomialFeatures, DictVectorizer) , it would reduce friction when using DataFrameMapper.
I think it's a good idea.
The only consideration is that transformed_features_ is filled when DataFrameMapper.transform is called and it could change after every subsequent transform call.
While in contrast, the sklearn's get_feature_names() only needs the transformer to be fitted, and the features names doesn't change after a transform call.
@molaxx thanks for the suggestion, I also think it's a good idea, with the caveat that @arnau126 mentions. Can you submit a PR with the implementation and some modifications to the README to indicate the availability of the method? Thanks.
@arnau126 this behavior seems like a bug. Why would we want the features to change after each transform? Further more, after fitting and pickling a model, the feature set used for training is lost. I think this logic should move to DataFrameMapper.fit. What do you think?
Currently it's not possible to move this logic to fit because get_names (the function used to build transformed_features_) needs the transformed columns.
And it needs them because:
- not all transformers have get_feature_names() or classes_.
- sometimes classes_ doesn't contain the feature names.
https://github.com/pandas-dev/sklearn-pandas/blob/master/sklearn_pandas/dataframe_mapper.py#L241
Couldn't this be handled by transforming a single row after the fitting? It's a bit hacky, but not having feature names after a fit is a bit surprising. I like boring API's :) Plus, I'll probably end up doing this manually so the features names would be available right after unpickling the model.
@molaxx I don't like the idea of transforming just one row to be able to get the feature names, it's too hacky. I understand it can be surprising that one needs to transform the data to be able to get the column names, but this is due to the complex nature of the custom transformers.
What we can do is to try to get these from the last transformer for each column during fit, like FeatureUnion does, and fail if they cannot be extracted, with a message indicating that one has to transform first to get inferred column names in that cases.
Are you up for PR such a feature?
Ok. Sounds good. I'll find time to work on it. Can you point me to a transformer that does not allow getting feature names post fit so i can test my solution?
On Sun, 20 Aug 2017 at 14:19 Israel Saeta Pérez [email protected] wrote:
@molaxx https://github.com/molaxx I don't like the idea of transforming just one row to be able to get the feature names, it's too hacky. I understand it can be surprising that one needs to transform the data to be able to get the column names, but this is due to the complex nature of the custom transformers.
What we can do is to try to get these from the last transformer for each column during fit, like FeatureUnion does https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/pipeline.py#L684, and fail if they cannot be extracted, with a message indicating that one has to transform first to get inferred column names in that cases.
Are you up for PR such a feature?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/sklearn-pandas/issues/109#issuecomment-323578727, or mute the thread https://github.com/notifications/unsubscribe-auth/AKiHqPqLDKkKdndCkiF1FrnDb3ZExb_Qks5saBY0gaJpZM4ONd0s .
I believe that any transformer that doesn't have a classes_ attribute or a get_feature_names method, which are the ones that _get_feature_names leverages, will work for the test. You can create a mock one in the tests.
@molaxx OneHotEncoder is an example of a transformer without classes_ or get_feature_names(). See http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
@arnau126 this behavior seems like a bug. Why would we want the features to change after each transform? Further more, after fitting and pickling a model, the feature set used for training is lost. I think this logic should move to DataFrameMapper.fit. What do you think?
Just popping in to say that I just spent ages trying to debug different numbers of columns in my training and test sets because it turns out the test set had lan extra label in a column that was being one-hot encoded. It would have been way easier to have some exception along the lines of "Number of columns from feature <feature> after transformation does not match last call" or something like that.
What's the status on this?
Is there any known workaround?
@JohnPaton , @iDmple can you provide a simple example that I can use to build and test the solution.
Sorry, that was 2 years ago, I don't have it lying around now 😕