mlxtend
mlxtend copied to clipboard
Adds fit_params support for stacking classifiers
Description
This PR aims to add fit parameter support for StackingClassifier, StackingCVClassifier, and EnsembleVoteClassifier.
Related issues or pull requests
Fixes #177, fixes #178, fixes #179
Pull Request requirements
- [ ] Added appropriate unit test functions in the
./mlxtend/*/testsdirectories - [ ] Ran
nosetests ./mlxtend -svand make sure that all unit tests pass - [ ] Checked the test coverage by running
nosetests ./mlxtend --with-coverage - [ ] Checked for style issues by running
flake8 ./mlxtend - [ ] Added a note about the modification or contribution to the
./docs/sources/CHANGELOG.mdfile - [ ] Modify documentation in the appropriate location under
mlxtend/docs/sources/(@rasbt will take care of that) - [ ] Checked that the Travis-CI build passed at https://travis-ci.org/rasbt/mlxtend
Hello @jrbourbeau! Thanks for updating the PR.
Cheers ! There are no PEP8 issues in this Pull Request. :beers:
Comment last updated on October 20, 2017 at 03:42 Hours UTC
@rasbt so far I've only added fit_param support for StackingClassifier. Any comments on the implementation? The following code snippet with cross_val_score should work now
import numpy as np
from mlxtend.classifier import StackingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_blobs
from sklearn.model_selection import cross_val_score
# Generate some data
X, y = make_blobs(random_state=2)
# Build a StackingClassifier
classifiers=[RandomForestClassifier(random_state=2),
SGDClassifier(random_state=2)]
meta_classifier = RandomForestClassifier(random_state=2)
sclf = StackingClassifier(classifiers=classifiers, meta_classifier=meta_classifier)
# Define some fit_params
fit_params = {'randomforestclassifier__sample_weight': np.arange(X.shape[0]),
'sgdclassifier__intercept_init': np.unique(y),
'meta-randomforestclassifier__sample_weight': np.full(X.shape[0], 7)}
# Pass fit params to StackingClassifier and cross_val_score
sclf.fit(X, y, **fit_params)
print('predictions = {}'.format(sclf.predict(X)))
print('scores = {}'.format(cross_val_score(sclf, X, y, fit_params=fit_params)))
Coverage decreased (-0.06%) to 89.597% when pulling 6fd998d72370c79d60c03028ec99c66bea0cccb0 on jrbourbeau:add_fit_params_for_cross_val_score into 3424df61f552f362e7db69281dcc4acdffe06cb4 on rasbt:master.
Coverage increased (+0.05%) to 89.705% when pulling 27206b381d3b50bc5182f17503ad4bff71a696b5 on jrbourbeau:add_fit_params_for_cross_val_score into 3424df61f552f362e7db69281dcc4acdffe06cb4 on rasbt:master.
Wow, this is awesome -- it even includes the support for both the level-2 classifiers as well as the meta-classifier. Again, thanks so much for the PR. I am happy to help with extending this to the other ensemble/stacking classifiers (and regressors) :).
Great! I can work on adding fit parameters to other estimators. Here are the ones I had in mind:
- [x]
StackingClassifier - [x]
StackingCVClassifier - [ ]
EnsembleVoteClassifier - [x]
StackingRegressor - [ ]
StackingCVRegressor
Any others you can think of?
I think that includes all of the ensemble methods I could currently think of as well :). I was/am a bit busy due to paper deadline end of the week, but I am happy to take care of a few of them as well so that you don't have to work on it all alone -- and of course, there's really no hurry :)
Just a very minor suggestion, could you change the docstring for the fit_params in fit() to the following:
fit_params : dict of string -> object, optional
Parameters to pass to the fit methods of the `classifiers` and
`meta_classifier`.
(The string -> object is how scikit-learn lists it; it's maybe good to use the same convention for consistency)
For sure, I definitely think the fit_params docstring here should match the corresponding scikit-learn docstring. Where did you find the fit_params : dict of string -> object, optional docstring in scikit-learn? I was using fit_params : dict, optional from the cross_val_score documentation http://scikit-learn.org/dev/modules/generated/sklearn.model_selection.cross_val_score.html
Yeah, I think that also "fit_params : dict of string -> object, optional" would only be minimally more helpful compared to "dict" (found it in the GridSearchCV docs; http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).
Do you think adding something like Example 3 - Stacked Regression with sample_weight to the StackingRegressor user guide page would be more useful for users? (similar examples could be added for stacked classifiers as well). Note sure if that's too specific an example or not.
Coverage increased (+0.09%) to 89.75% when pulling 05942e9f9351a3c4f2f91543e69210c752c2522b on jrbourbeau:add_fit_params_for_cross_val_score into 3424df61f552f362e7db69281dcc4acdffe06cb4 on rasbt:master.
Example 3 - Stacked Regression with sample_weight
Yeah, I think this would be very useful! Maybe, "Example 3 - Stacked Regression with sample_weight using fit_params"? The reason is that it may be a bit more clear how to refer to the meta vs first-level classifiers regarding the string names.
@rasbt sorry for the delay, I've been out of town at a conference.
I ran into an issue when trying to add fit_param support for StackingCVClassifier. It issue arises when each CV fold is being trained on.
https://github.com/rasbt/mlxtend/blob/5735e00afcf83e014b90a6726de8cf91354406eb/mlxtend/classifier/stacking_cv_classification.py#L168-L175
Some classifier fit parameters are arrays that have a value for each sample being trained on (e.g. the sample_weight parameter for RandomForestClassifier). This case is pretty straight-forward because the sample_weight array can be indexed using the same train_index array as the training data.
However, it's less clear how to deal with fit parameters that aren't of shape (n_samples,). For example, SGDClassifier has coef_init and intercept_init fit parameters that are of shape (n_classes, n_features) and (n_classes,), respectively. I guess one could check if a fit pararmeter is an array of shape (n_samples,), and if so index it with the train_index array. But this seems a little hack-ish to me :confused: I can already think of a couple of edge cases that would lead to problems.
Any suggestions on how to get around this issue?
No need to apologize, hope you enjoyed the conference and the trip!
how to deal with fit parameters
Hm, that's a good question ...
n_classes: I guess we don't have to worry aboutn_classesas that shouldn't change.n_features: We don't have to worry about that in the 1st level classifiers as the k-folds should have the same feature dimension as the training set. However, it's a bit tricky regarding the 2nd-level (meta) classifier. Maybe we should add a check in thefitmethod for that, for example,
if not self.use_features_in_secondary:
n_features_in_2nd = len(self.classifiers)
else:
n_features_in_2nd += X.shape[1]
if 'coef_init' in meta_fit_params and\
meta_fit_params['coef_init'].shape[1] != n_features_in_2nd:
raise AttributeError('Number of features in the `fit_params`'s `coef_init` array'
' of the meta-classifier must be equal to'
' the number of features this classifier expects based on the'
' number of first-level classifiers and wether `use_features_in_secondary`
' is set to `True` or `False`.'
' Expected: %d'
' Got: %d' % (n_features_in_2nd, meta_fit_params['coef_init'].shape[1]))
On the other hand, there are probably multiple parameters that use shapen_features somehow. And my guess is it would be quite tricky to maintain this.
n_samples: this is probably most tricky since we are modifying the training set in a sense, and like you said, we kind of need a way to determine when to use the subindices to pass the correct weights.
I agree with you that an automated checking based on the shape is quite unstable and might break in certain edge cases. Hm, I currently don't have a good idea for how to handle that.
For now, maybe we should just handle sample_weight explicitly via passing the respective the training set indices and be very clear about that in the fit docstring that and mention that any other parameter based on sample_weights shapes (and n_features for the meta-classifier) might result in unexpected behavior?
~Btw, I just saw that the VotingClassifier in scikit-learn (which we ported from mlxtend, aka the EnsembleVoteClassifier some time ago) currently also doesn't support sample weights:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/voting_classifier.py#L176~
(Edit: Please ignore the passage above, I misread the code)
So, I was thinking that before we come up with some hacky work-arounds for those parameters that rely on n_samples or n_features, we could maybe take a similar approach and just allow support for fit params that are more general, e.g,. the ones mentioned in
https://github.com/rasbt/mlxtend/issues/177
'xgbclassifier__eval_metric': 'mlogloss',
'xgbclassifier__eval_set': [(X_test, y_test)],
'xgbclassifier__early_stopping_rounds': 100,
'xgbclassifier__verbose': False}
So, I think the best way would be to add similar exceptions for
sample_weight, coef, coef_init
to get at least some working versions that generally support fit_params without some unexpected behavior in edge cases :).
That sounds like a good plan to me! So just to be clear, right now we'll only support fit_params that are the same for each training fold (e.g. like the params in #177)? So something like
for num, (train_index, test_index) in enumerate(skf):
if self.verbose > 0:
print("Training and fitting fold %d of %d..." %
((num + 1), final_cv.get_n_splits()))
try:
model.fit(X[train_index], y[train_index], *clf_fit_params)
will work because we won't have to worry about having different slices of fit_params for different CV folds.
And obviously this will need to be clarified in the documentation :smiley:
Oh yeah, that's probably the best way to handle this for now :)
Coverage increased (+1.3%) to 90.918% when pulling 162c159872cf9a421aee1be5bec7ffa8c7190b04 on jrbourbeau:add_fit_params_for_cross_val_score into 3424df61f552f362e7db69281dcc4acdffe06cb4 on rasbt:master.
Thanks for the PR! Hm, it's weird that the unit test on Windows for Py 2.7 fail. My first thought was that it could be due to banker's rounding in Python 3. I.e.,
Python 2.7 >>> round(2.5) 3.0
Python 3: >>> round(2.5) 2
But then, it wouldn't explain why the Py27 pass in Travis. Any idea about what might be going on?
Yeah, that is really weird that the tests fails for 2.7 on AppVeyor, but not on Travis — not sure what's going on there. I have seen kind of random test failures happen before on some CI services. I'll try adding a small commit to see if re-running the tests fails again.
Coverage increased (+1.3%) to 90.918% when pulling 7ac78874a6b6c17d7cc0083d203309821756ac70 on jrbourbeau:add_fit_params_for_cross_val_score into 3424df61f552f362e7db69281dcc4acdffe06cb4 on rasbt:master.
Sorry about the trouble with AppVeyor! I see sth like
assert scores_mean == 0.94, scores_mean
AssertionError: 0.95
I am not sure why it suddenly occurs, it seems like some rounding error somewhere. I can't see anything in your new code additions (like integer division) that may cause this as those are "old" unit tests that fail. This is really weird!
Arg, I just found an issue with Travis CI. I.e., the unit tests were not properly executed in python 2.7. It's fixed in the master branch now.
Maybe try to sync your master branch of your fork (e.g., like described here: http://rasbt.github.io/mlxtend/contributing/#syncing-an-existing-fork)
and then you could rebase your fork on top of the master branch. I think the following should do it:
git checkout your_branch
git rebase upstream/master
Alternatively, instead of rebasing, you could execute the following four commands that should also fix the problem:
git checkout origin/master ci/.travis_install.sh
git checkout origin/master ci/.travis_test.sh
git checkout origin/master .travis.yml
git checkout origin/master mlxtend/plotting/test_ecdf.py
(the last one was due to a bug in Python 2.7 as the travis py27 tests didn't run recently, and Appveyor does not run plotting functions).
This will probably not solve the unit test issue, but at least we would know if there's something odd about Python 2.7 on Windows or in general (once more, Python 2.7 turns out to be an annoyance, time to replace it completely by Python 3.6 :))
Awesome, good to know! I'll update and see if we at least get consistent test failures.
once more, Python 2.7 turns out to be an annoyance, time to replace it completely by Python 3.6 :)
:+1:
Coverage increased (+0.1%) to 90.918% when pulling 2372300a40137f488b3f0e2849e85c51de52075f on jrbourbeau:add_fit_params_for_cross_val_score into 922f44f0189877e769131fca117550c51e2ee545 on rasbt:master.
Awesome, good to know! I'll update and see if we at least get consistent test failures.
Thanks! I am curious to see how to see how that will turn out
I believe the rebase worked fine, my commits for this PR are now on top of your commits related to Python 2.7 on Travis. But it looks like the tests are still passing on Travis, but not on AppVeyor :confused:
Oh hm, that's weird. It seems like the files are still the old ones and Travis is still not running those Py27 tests. Sorry about the inconvenience, but maybe this could be resolved via the following (after you synced your master branch with the upstream one)
git checkout origin/master ci/.travis_install.sh
git checkout origin/master ci/.travis_test.sh
git checkout origin/master .travis.yml
git checkout origin/master mlxtend/plotting/test_ecdf.py
Thanks for helping out with this!
Just to be clear, I've done
# Switch to master branch from feature branch
$ git checkout master
# Update my local master branch to be up-to-date with upstream
$ git fetch upstream
$ git merge upstream/master
# Update origin to also be synced up with upstream
$ git push origin master
which has synced up my master branch with upstream (your mlxtend repo). A quick inspection of git log on my local copy master branch matches the upstream log (https://github.com/rasbt/mlxtend/commits/master) and my origin log (https://github.com/jrbourbeau/mlxtend/commits/master) — so I think everything should be up-to-date.
I then switched to my feature branch and did the checkouts you've provided
# Switch back to feature branch
$ git checkout add_fit_params_for_cross_val_score
# Checkout files from my updated origin/master
$ git checkout origin/master ci/.travis_install.sh
$ git checkout origin/master ci/.travis_test.sh
$ git checkout origin/master .travis.yml
$ git checkout origin/master mlxtend/plotting/tests/test_ecdf.py
This didn't seem to update any of my local files (at least git status didn't seem to indicate any changes have been made). So, if I'm not mistaken, then everything (origin, my local master, upstream, and my local feature branch) should all be up to date with one another.