pipelinehelper
pipelinehelper copied to clipboard
grid.best_estimator_.get_params() vague selected_model output
In my example code GaussianNB()
was selected as the best estimator however it seems like the selected_model output from grid.best_estimator_.get_params()
does not reflect this, although I have instantiated it as GaussianNB
in the PipelineHelper. The selected_model does however show the parameters for GaussianNB()
such as priors
and var_smoothing
. The available_models
in output for grid.get_params().keys()
looks fine though.
I suspect this has something to do with the fact that I have left the default parameters for GaussianNB()
as they are and did not put anything in the grid_search.
here is the grid.best_estimator_.get_params()
output
['clf',
'clf__available_models',
'clf__available_models__ExtraTreesClassifier',
'clf__available_models__ExtraTreesClassifier__bootstrap',
'clf__available_models__ExtraTreesClassifier__ccp_alpha',
'clf__available_models__ExtraTreesClassifier__class_weight',
'clf__available_models__ExtraTreesClassifier__criterion',
'clf__available_models__ExtraTreesClassifier__max_depth',
'clf__available_models__ExtraTreesClassifier__max_features',
'clf__available_models__ExtraTreesClassifier__max_leaf_nodes',
'clf__available_models__ExtraTreesClassifier__max_samples',
'clf__available_models__ExtraTreesClassifier__min_impurity_decrease',
'clf__available_models__ExtraTreesClassifier__min_impurity_split',
'clf__available_models__ExtraTreesClassifier__min_samples_leaf',
'clf__available_models__ExtraTreesClassifier__min_samples_split',
'clf__available_models__ExtraTreesClassifier__min_weight_fraction_leaf',
'clf__available_models__ExtraTreesClassifier__n_estimators',
'clf__available_models__ExtraTreesClassifier__n_jobs',
'clf__available_models__ExtraTreesClassifier__oob_score',
'clf__available_models__ExtraTreesClassifier__random_state',
'clf__available_models__ExtraTreesClassifier__verbose',
'clf__available_models__ExtraTreesClassifier__warm_start',
'clf__available_models__GaussianNB',
'clf__available_models__GaussianNB__priors',
'clf__available_models__GaussianNB__var_smoothing',
'clf__available_models__RandomForestClassifier',
'clf__available_models__RandomForestClassifier__bootstrap',
'clf__available_models__RandomForestClassifier__ccp_alpha',
'clf__available_models__RandomForestClassifier__class_weight',
'clf__available_models__RandomForestClassifier__criterion',
'clf__available_models__RandomForestClassifier__max_depth',
'clf__available_models__RandomForestClassifier__max_features',
'clf__available_models__RandomForestClassifier__max_leaf_nodes',
'clf__available_models__RandomForestClassifier__max_samples',
'clf__available_models__RandomForestClassifier__min_impurity_decrease',
'clf__available_models__RandomForestClassifier__min_impurity_split',
'clf__available_models__RandomForestClassifier__min_samples_leaf',
'clf__available_models__RandomForestClassifier__min_samples_split',
'clf__available_models__RandomForestClassifier__min_weight_fraction_leaf',
'clf__available_models__RandomForestClassifier__n_estimators',
'clf__available_models__RandomForestClassifier__n_jobs',
'clf__available_models__RandomForestClassifier__oob_score',
'clf__available_models__RandomForestClassifier__random_state',
'clf__available_models__RandomForestClassifier__verbose',
'clf__available_models__RandomForestClassifier__warm_start',
'clf__optional',
'clf__selected_model',
'clf__selected_model__priors',
'clf__selected_model__var_smoothing',
'memory',
'preprosessor',
'preprosessor__C_Fimp',
'preprosessor__C_Fimp__cat_imputer',
'preprosessor__C_Fimp__cat_imputer__add_indicator',
'preprosessor__C_Fimp__cat_imputer__copy',
'preprosessor__C_Fimp__cat_imputer__fill_value',
'preprosessor__C_Fimp__cat_imputer__missing_values',
'preprosessor__C_Fimp__cat_imputer__strategy',
'preprosessor__C_Fimp__cat_imputer__verbose',
'preprosessor__C_Fimp__memory',
'preprosessor__C_Fimp__onehot',
'preprosessor__C_Fimp__onehot__categories',
'preprosessor__C_Fimp__onehot__drop',
'preprosessor__C_Fimp__onehot__dtype',
'preprosessor__C_Fimp__onehot__handle_unknown',
'preprosessor__C_Fimp__onehot__sparse',
'preprosessor__C_Fimp__steps',
'preprosessor__C_Fimp__verbose',
'preprosessor__N_Fimp',
'preprosessor__N_Fimp__memory',
'preprosessor__N_Fimp__num_imputer',
'preprosessor__N_Fimp__num_imputer__add_indicator',
'preprosessor__N_Fimp__num_imputer__copy',
'preprosessor__N_Fimp__num_imputer__fill_value',
'preprosessor__N_Fimp__num_imputer__missing_values',
'preprosessor__N_Fimp__num_imputer__strategy',
'preprosessor__N_Fimp__num_imputer__verbose',
'preprosessor__N_Fimp__steps',
'preprosessor__N_Fimp__verbose',
'preprosessor__n_jobs',
'preprosessor__remainder',
'preprosessor__sparse_threshold',
'preprosessor__transformer_weights',
'preprosessor__transformers',
'preprosessor__verbose',
'steps',
'verbose']
I'm afraid I don't understand your question:
In my example code GaussianNB() was selected as the best estimator however it seems like the selected_model output from grid.best_estimator_.get_params() does not reflect this
In the above output, the lines
'clf__selected_model',
'clf__selected_model__priors',
'clf__selected_model__var_smoothing',
suggest that the GaussianNB model was selected as the best estimator, as you describe. What am I missing?
shouldn't it be written as 'clf__selected_model__GaussianNB__priors'
instead of 'clf__selected_model__priors'
? it is not very easy to determine that the selected model was GaussianNB
just by looking at parameters that are written in front of the clf__selected_model
. It is not very explicit given that I have specially defined ("GaussianNB", GaussianNB())
in my PipelineHelper
in my example code.
This will become specially problematic if you have RandomForestClassifier
and ExtraTreesClassifier
in your PipelineHelper
, both of which share almost identical parameters and you have to figure out which one was chosen as selected_model
when calling grid.best_estimator_.get_params()
Ah OK, I now see what you mean. I agree that this would be helpful, but I'll have to think about the internal changes that this fix would imply.
If this is not a trivial matter, then that is fine. A user can always use the grid.best_params_
command and they can see what the best chosen parameter is. I just thought it would be nice to have it in the grid.best_estimator_.get_params()
command.
I like to play with something like this, specially when one is using two scoring functions:
grid = GridSearchCV(pipe, params, scoring='accuracy', verbose=0, n_jobs=-1)
grid.fit(X, y)
df_grid_search = pd.DataFrame(grid.cv_results_)
df_grid_search = df_grid_search.set_index('params')[['mean_fit_time','mean_score_time','mean_test_score',\
'std_test_score','rank_test_score']]
df_grid_search.sort_values(by = 'rank_test_score').head(10)
or with more code-noise:
grid = GridSearchCV(pipe, params, scoring='accuracy', verbose=0, n_jobs=-1)
grid.fit(X, y)
df_grid_search = pd.DataFrame(grid.cv_results_)
df_grid_search['params'] = [str(list(x.values())).replace('(',"").replace(')',"") for x in df_grid_search['params']]
df_grid_search = df_grid_search.set_index('params')[['mean_fit_time','mean_score_time'] + \
[x for x in df_grid_search.columns if ('rank_test' in x) or ('mean_test' in x)]]
df_grid_search.sort_values(by = [x for x in df_grid_search.columns if 'rank_test' in x]).head(10)