scikit-learn
scikit-learn copied to clipboard
Incorrect documentation for `warm_start` behavior on BaseForest-derived classes
Describe the issue linked to the documentation
The RandomForestRegressor documentation states:
When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.
This is also true for all classes that derive from BaseForest
, such as RandomForestClassifier
and RandomTreesEmbedding
.
However, the source code does not reflect this behavior. When n_more_estimators == 0
, it does not fit a new forest and instead just recomputes the OOB score if applicable.
Suggest a potential alternative/fix
There are two potential fixes:
- Reword the documentation to state:
When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, reuse the existing ensemble.
- Modify the actual behavior of this method to fit a new forest in the case where
n_more_estimators == 0
to reflect the existing documentation.
Thanks for submitting an issue @noahgolmant ,
The current documentation is correct:
When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.
but it's lacking a few things. In particular, and I think this is where the confusion comes from, it does not specify that one needs to re-set the n_estimators
parameter manually before calling fit a second time:
est = RandomForestClassifier(n_estmiators=100)
est.set_params(n_estimators=200, warm_start=True) # set warm_start and new nr of trees
est.fit(X_train, y_train) # fit additional 100 trees to est
Regarding the documentation, I think we just need to link to https://scikit-learn.org/stable/modules/ensemble.html#fitting-additional-weak-learners.
Regarding the OOB computation: let's just remove it if n_more_estimators == 0
and only do it when n_more_estimators > 0
. The input data should be the same anyway (using warm-start with different data makes no sense), so computing the OOB again won't change its value, but it's unnecessary.
Can I work on this issue?
@yashasvimisra2798 sure, thanks!
Hi @cmarmo , I am a beginner and I would like to help
@ryuusama09 , welcome! If you are willing to work on this issue please have a careful look to @NicolasHug comment. Also, in the contributor guide you will find all the information you need to start contributing.
I'm ready to solve this issue if no one has yet started working on it @NicolasHug could you please tell me what are the exact changes you are expecting?
take
Hi, if no one is currently working on it, I'd like to contribute. Could you please assign it to me?
Hi @cmarmo, I've submitted a pull request (https://github.com/scikit-learn/scikit-learn/pull/24579) based on @NicolasHug's comments in https://github.com/scikit-learn/scikit-learn/issues/20435#issuecomment-872835169. Could you help me take a look at it? Thanks!
I am reopening this one as one of the tasks is still unaddressed:
Regarding the OOB computation: let's just remove it if n_more_estimators == 0 and only do it when n_more_estimators > 0. The input data should be the same anyway (using warm-start with different data makes no sense), so computing the OOB again won't change its value, but it's unnecessary.
Perhaps @choo8 would like to finalize it?
We should repurpose the issue because this is not a documentation issue but more of a maintenance one.
There is also #24764 which would be worth considering as part of this. There the issue is what number of iterations should an estimator report if warm start is used. From a discussion there it seems that increasing the consistency between estimators would be useful.
@cmarmo I would like to help with this issue. How should this PR be done given #24764?
For #24764, we need to make sure that we have the intended behaviour. I did not get time to look deeper in the issue.
@choo8 , I believe your changes https://github.com/scikit-learn/scikit-learn/pull/24579/files/b2c28778c459d0c58fac1027c618c048543e0764 (with additional tests as specified in https://github.com/scikit-learn/scikit-learn/pull/24579#discussion_r1012729686) will solve this issue. The link with #24764 is important for maintainers to keep in mind general issues with warm start.
Thank you for the comments, @cmarmo and @glemaitre, I will work on a PR incorporating what both of you have suggested.
Is this issue still open, because I do see that this is merged. I believe the status for this issue shall be updated, just to avoid unnecessary confusion.
@amay1212 This is still an open issue as stated in different messages. #26318 is apparently trying to solve the issue.
We still need this issue to be open.
@glemaitre Thanks for the update!