sklearn-pandas icon indicating copy to clipboard operation
sklearn-pandas copied to clipboard

Expose parameters from transformers as parameters of the mapper

Open gwerbin opened this issue 7 years ago • 8 comments
trafficstars

Currently, it can be hard to use a "parametric" transformer in a DataFrameMapper because the parameters of the underlying transformers aren't exposed. This means you can't adjust the parameters of one of those transformers using GridSearchCV or RandomizedSearchCV.

Example:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn_pandas import DataFrameMapper

pipeline = Pipeline([
    ('vectorizer',
        DataFrameMapper([
            ('document_contents', CountVectorizer())
        ], df_out=False)),
    ('classifier', MultinomialNB())
])

pipeline.get_params()

These are the params I get:

{'memory': None,
 'steps': [('vectorizer', DataFrameMapper(default=False, df_out=False,
           features=[('document_contents', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=None, vocabulary=None))],
           input_df=False, sparse=False)),
  ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
 'vectorizer': DataFrameMapper(default=False, df_out=False,
         features=[('document_contents', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=1.0, max_features=None, min_df=1,
         ngram_range=(1, 1), preprocessor=None, stop_words=None,
         strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
         tokenizer=None, vocabulary=None))],
         input_df=False, sparse=False),
 'classifier': MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
 'vectorizer__default': False,
 'vectorizer__df_out': False,
 'vectorizer__features': [('document_contents',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=None, vocabulary=None))],
 'vectorizer__input_df': False,
 'vectorizer__sparse': False,
 'classifier__alpha': 1.0,
 'classifier__class_prior': None,
 'classifier__fit_prior': True}

Naively, I would expect something like this

{'memory': None,
 'steps': [('vectorizer', DataFrameMapper(default=False, df_out=False,
           features=[('document_contents', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=None, vocabulary=None))],
           input_df=False, sparse=False)),
  ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
 'vectorizer': DataFrameMapper(default=False, df_out=False,
         features=[('document_contents', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=1.0, max_features=None, min_df=1,
         ngram_range=(1, 1), preprocessor=None, stop_words=None,
         strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
         tokenizer=None, vocabulary=None))],
         input_df=False, sparse=False),
 'classifier': MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
 'vectorizer__document_contents__analyzer': 'word',
 'vectorizer__document_contents__binary': False,
 'vectorizer__document_contents__decode_error': 'strict',
 'vectorizer__document_contents__dtype': numpy.int64,
 'vectorizer__document_contents__encoding': 'utf-8',
 'vectorizer__document_contents__input': 'content',
 'vectorizer__document_contents__lowercase': True,
 'vectorizer__document_contents__max_df': 1.0,
 'vectorizer__document_contents__max_features': None,
 'vectorizer__document_contents__min_df': 1,
 'vectorizer__document_contents__ngram_range': (1, 1),
 'vectorizer__document_contents__preprocessor': None,
 'vectorizer__document_contents__stop_words': None,
 'vectorizer__document_contents__strip_accents': None,
 'vectorizer__document_contents__token_pattern': '(?u)\\b\\w\\w+\\b',
 'vectorizer__document_contents__tokenizer': None,
 'vectorizer__document_contents__vocabulary': None,
 'vectorizer__default': False,
 'vectorizer__df_out': False,
 'vectorizer__input_df': False,
 'vectorizer__sparse': False,
 'classifier__alpha': 1.0,
 'classifier__class_prior': None,
 'classifier__fit_prior': True}

which would be very handy for, say, using GridSearchCV to compare word and character analyzers.

This seems like it shouldn't be too hard to implement. If there's interest I can start digging around the codebase to try to spend some time on it.

gwerbin avatar Jul 26 '18 12:07 gwerbin

@gwerbin That would be helpful to make transformers from the mapper "grid-searchable". Just need to be sure that these "deep" parameters are appropriately assigned to the nested classes with the set_params method.

devforfu avatar Jul 26 '18 12:07 devforfu

@devforfu One thing I just thought of is how to handle mappers like this:

pipeline = Pipeline([
    ('vectorizer',
        DataFrameMapper([
            ('document_contents', [TextCleaner(), CountVectorizer()])
        ], df_out=False)),
    ('classifier', MultinomialNB())
])

(I made up the TextCleaner class, just for illustration)

What would the steps names be in this case? Maybe something like

'vectorizer__document_contents__0__text_cleaning_method': 'default',
'vectorizer__document_contents__1__analyzer': 'word',
'vectorizer__document_contents__1__binary': True,

gwerbin avatar Jul 27 '18 14:07 gwerbin

@gwerbin I guess it could be a name of class as well. As I can recall, make_pipeline() from scikit-learn creates pipeline steps names using lower-cased names of the classes this pipeline is made of. So probably in this case could be something similar:

'vectorizer__document_contents__textcleaner__text_cleaning_method': 'default',
'vectorizer__document_contents__countvectorizer__analyzer': 'word',
'vectorizer__document_contents__countvectorizer__binary': True,

devforfu avatar Aug 01 '18 04:08 devforfu

@gwerbin Thanks for your contribution! It would be certainly a very interesting feature, since currently it is impossible to adjust the internal parameters of the dataframe mapper in the pipeline in any optimization.

Would you be willing to implement such a feature? Ideally it should be as similar in interface to sklearn as possible, to be compatible with sklearn's grid or randomized searches.

dukebody avatar Aug 05 '18 16:08 dukebody

@dukebody I'm willing, but can't make any guarantees on a timeline. I've been pretty busy lately and don't want to commit to anything I can't deliver.

I would also need time to familiarize myself with how parameters are passed in the current code.

If anyone else wants to pick this up, I won't be offended.

gwerbin avatar Aug 06 '18 17:08 gwerbin

@gwerbin If nobody else had started working on this PR, I could make a try to come up with some basic solution. Of course, we can unite our efforts as soon as you become more available.

devforfu avatar Aug 07 '18 03:08 devforfu

Ok, I've started work on the proposed feature in my fork. There is a couple of new tests as well.

Probably some code required to implement get_params and set_params could be borrowed from the scikit-learn instead of writing a custom solution but I've decided to start with something straightforward at first. Though I think we can use TransformerPipeline for cases when only one transformer is defined for each of columns because it completely resembles the format of ('step_name', instance) expected by the pipeline.

Testing getters/setters now, next going to check if methods are compliant with GridSearchCV. As soon as the basic version is ready, will do a PR for a review/improvements.

devforfu avatar Aug 15 '18 08:08 devforfu

Marking as "good first issue" to review the PR you created @devforfu

dukebody avatar Oct 17 '18 17:10 dukebody