sklearn-pandas icon indicating copy to clipboard operation
sklearn-pandas copied to clipboard

CV on Params in DataFrameMapper Transforms?

Open andrewm4894 opened this issue 7 years ago • 2 comments

Apologies for posting as an issue but feel like could be a useful use case.

I'm just wondering if something like what i'm trying to do is or should be possible.

If i set up a pipeline like:

# make pipeline for individual variables
name_to_tfidf = Pipeline([ ('name_vect', CountVectorizer()) , ('name_tfidf', TfidfTransformer()) ])
ticket_to_tfidf = Pipeline([ ('ticket_vect', CountVectorizer()) , ('ticket_tfidf', TfidfTransformer()) ])

full_mapper = DataFrameMapper([
    ('Name', name_to_tfidf ),
    ('Ticket', ticket_to_tfidf ),
    ('Sex', LabelBinarizer())
    ])

# build full pipeline
full_pipeline  = Pipeline([
    ('mapper',full_mapper),
    ('clf', SGDClassifier(n_iter=15, warm_start=True))
])

Is there a way to pass a list of options to CV on for individual transforms in the DataFrameMapper like here:

# determine full param search space (need to get the params for the mapper parts in here somehow)
full_params = {'clf__alpha': [1e-2,1e-3,1e-4],
               'clf__loss':['modified_huber','hinge'],
               'clf__penalty':['l2','l1'],
               # now set the params for the datamapper part of the pipeline
               'mapper__features':[[
                   ('Name',deepcopy(name_to_tfidf).set_params(name_vect__analyzer = ['char', 'char_wb'])),
                   ('Ticket',deepcopy(ticket_to_tfidf).set_params(ticket_vect__analyzer = ['char', 'char_wb']))
               ]]
              }

Ideally id like to CV on what params are best for the name_to_tfidf and ticket_to_tfidf DataFrameMapper pipelines.

But passing a list of options to set_params() like this gives me this error when i go to fit:

ValueError: ['char', 'char_wb'] is not a valid tokenization scheme/analyzer

andrewm4894 avatar Sep 06 '17 10:09 andrewm4894

I think what you want is this GrideSearchCV, just create a GridSearchCV, pass to pipeline as a "normal " estimator, then you will get what you want.

scotthuang1989 avatar Sep 08 '17 07:09 scotthuang1989

My bad - I left that part out. I am doing this:

# set up grid search
gs_clf = GridSearchCV(full_pipeline, full_params, n_jobs=-1)

And then:

# do the fit
gs_clf.fit(df,df['Survived'])

So i am able to do the CV on the clf params but id also like to do CV on some params within the transforms in the DataFrameMapper - just not sure how to go about this.

Here is a full example notebook.

Basically i was passing ['char', 'char_wb'] to this line for example: ('Name',deepcopy(name_to_tfidf).set_params(name_vect__analyzer = ['char', 'char_wb'])),

As i was hoping the GridSearchCV would then also consider those two params in the grid.

andrewm4894 avatar Sep 08 '17 08:09 andrewm4894