sklearn-pandas
sklearn-pandas copied to clipboard
CV on Params in DataFrameMapper Transforms?
Apologies for posting as an issue but feel like could be a useful use case.
I'm just wondering if something like what i'm trying to do is or should be possible.
If i set up a pipeline like:
# make pipeline for individual variables
name_to_tfidf = Pipeline([ ('name_vect', CountVectorizer()) , ('name_tfidf', TfidfTransformer()) ])
ticket_to_tfidf = Pipeline([ ('ticket_vect', CountVectorizer()) , ('ticket_tfidf', TfidfTransformer()) ])
full_mapper = DataFrameMapper([
('Name', name_to_tfidf ),
('Ticket', ticket_to_tfidf ),
('Sex', LabelBinarizer())
])
# build full pipeline
full_pipeline = Pipeline([
('mapper',full_mapper),
('clf', SGDClassifier(n_iter=15, warm_start=True))
])
Is there a way to pass a list of options to CV on for individual transforms in the DataFrameMapper like here:
# determine full param search space (need to get the params for the mapper parts in here somehow)
full_params = {'clf__alpha': [1e-2,1e-3,1e-4],
'clf__loss':['modified_huber','hinge'],
'clf__penalty':['l2','l1'],
# now set the params for the datamapper part of the pipeline
'mapper__features':[[
('Name',deepcopy(name_to_tfidf).set_params(name_vect__analyzer = ['char', 'char_wb'])),
('Ticket',deepcopy(ticket_to_tfidf).set_params(ticket_vect__analyzer = ['char', 'char_wb']))
]]
}
Ideally id like to CV on what params are best for the name_to_tfidf and ticket_to_tfidf DataFrameMapper pipelines.
But passing a list of options to set_params() like this gives me this error when i go to fit:
ValueError: ['char', 'char_wb'] is not a valid tokenization scheme/analyzer
I think what you want is this GrideSearchCV, just create a GridSearchCV, pass to pipeline as a "normal " estimator, then you will get what you want.
My bad - I left that part out. I am doing this:
# set up grid search
gs_clf = GridSearchCV(full_pipeline, full_params, n_jobs=-1)
And then:
# do the fit
gs_clf.fit(df,df['Survived'])
So i am able to do the CV on the clf params but id also like to do CV on some params within the transforms in the DataFrameMapper - just not sure how to go about this.
Here is a full example notebook.
Basically i was passing ['char', 'char_wb']
to this line for example:
('Name',deepcopy(name_to_tfidf).set_params(name_vect__analyzer = ['char', 'char_wb'])),
As i was hoping the GridSearchCV would then also consider those two params in the grid.