scikit-learn icon indicating copy to clipboard operation
scikit-learn copied to clipboard

Incorrect documentation for `warm_start` behavior on BaseForest-derived classes

Open noahgolmant opened this issue 3 years ago • 6 comments

Describe the issue linked to the documentation

The RandomForestRegressor documentation states:

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.

This is also true for all classes that derive from BaseForest, such as RandomForestClassifier and RandomTreesEmbedding.

However, the source code does not reflect this behavior. When n_more_estimators == 0, it does not fit a new forest and instead just recomputes the OOB score if applicable.

Suggest a potential alternative/fix

There are two potential fixes:

  1. Reword the documentation to state:

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, reuse the existing ensemble.

  1. Modify the actual behavior of this method to fit a new forest in the case where n_more_estimators == 0 to reflect the existing documentation.

noahgolmant avatar Jul 01 '21 16:07 noahgolmant

Thanks for submitting an issue @noahgolmant ,

The current documentation is correct:

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.

but it's lacking a few things. In particular, and I think this is where the confusion comes from, it does not specify that one needs to re-set the n_estimators parameter manually before calling fit a second time:

est = RandomForestClassifier(n_estmiators=100)
est.set_params(n_estimators=200, warm_start=True)  # set warm_start and new nr of trees
est.fit(X_train, y_train) # fit additional 100 trees to est

Regarding the documentation, I think we just need to link to https://scikit-learn.org/stable/modules/ensemble.html#fitting-additional-weak-learners.

Regarding the OOB computation: let's just remove it if n_more_estimators == 0 and only do it when n_more_estimators > 0. The input data should be the same anyway (using warm-start with different data makes no sense), so computing the OOB again won't change its value, but it's unnecessary.

NicolasHug avatar Jul 02 '21 08:07 NicolasHug

Can I work on this issue?

yashasvimisra2798 avatar Jul 17 '21 01:07 yashasvimisra2798

@yashasvimisra2798 sure, thanks!

NicolasHug avatar Jul 17 '21 08:07 NicolasHug

Hi @cmarmo , I am a beginner and I would like to help

ryuusama09 avatar Sep 13 '22 15:09 ryuusama09

@ryuusama09 , welcome! If you are willing to work on this issue please have a careful look to @NicolasHug comment. Also, in the contributor guide you will find all the information you need to start contributing.

cmarmo avatar Sep 13 '22 18:09 cmarmo

I'm ready to solve this issue if no one has yet started working on it @NicolasHug could you please tell me what are the exact changes you are expecting?

Pro-Coder04 avatar Sep 21 '22 10:09 Pro-Coder04

take

choo8 avatar Oct 04 '22 21:10 choo8

Hi, if no one is currently working on it, I'd like to contribute. Could you please assign it to me?

aniray2908 avatar Oct 07 '22 16:10 aniray2908

Hi @cmarmo, I've submitted a pull request (https://github.com/scikit-learn/scikit-learn/pull/24579) based on @NicolasHug's comments in https://github.com/scikit-learn/scikit-learn/issues/20435#issuecomment-872835169. Could you help me take a look at it? Thanks!

choo8 avatar Oct 07 '22 16:10 choo8

I am reopening this one as one of the tasks is still unaddressed:

Regarding the OOB computation: let's just remove it if n_more_estimators == 0 and only do it when n_more_estimators > 0. The input data should be the same anyway (using warm-start with different data makes no sense), so computing the OOB again won't change its value, but it's unnecessary.

Perhaps @choo8 would like to finalize it?

cmarmo avatar Nov 14 '22 00:11 cmarmo

We should repurpose the issue because this is not a documentation issue but more of a maintenance one.

glemaitre avatar Nov 14 '22 09:11 glemaitre

There is also #24764 which would be worth considering as part of this. There the issue is what number of iterations should an estimator report if warm start is used. From a discussion there it seems that increasing the consistency between estimators would be useful.

betatim avatar Nov 14 '22 09:11 betatim

@cmarmo I would like to help with this issue. How should this PR be done given #24764?

choo8 avatar Nov 14 '22 14:11 choo8

For #24764, we need to make sure that we have the intended behaviour. I did not get time to look deeper in the issue.

glemaitre avatar Nov 14 '22 18:11 glemaitre

@choo8 , I believe your changes https://github.com/scikit-learn/scikit-learn/pull/24579/files/b2c28778c459d0c58fac1027c618c048543e0764 (with additional tests as specified in https://github.com/scikit-learn/scikit-learn/pull/24579#discussion_r1012729686) will solve this issue. The link with #24764 is important for maintainers to keep in mind general issues with warm start.

cmarmo avatar Nov 14 '22 20:11 cmarmo

Thank you for the comments, @cmarmo and @glemaitre, I will work on a PR incorporating what both of you have suggested.

choo8 avatar Nov 15 '22 00:11 choo8

Is this issue still open, because I do see that this is merged. I believe the status for this issue shall be updated, just to avoid unnecessary confusion.

amay1212 avatar May 01 '23 12:05 amay1212

@amay1212 This is still an open issue as stated in different messages. #26318 is apparently trying to solve the issue.

We still need this issue to be open.

glemaitre avatar May 04 '23 09:05 glemaitre

@glemaitre Thanks for the update!

amay1212 avatar May 04 '23 10:05 amay1212