scikit-learn
scikit-learn copied to clipboard
FEA Add missing-value support for ExtaTreeClassifier and ExtaTreeRegressor
Reference Issues/PRs
Towards: #27931
Follow-up to: https://github.com/scikit-learn/scikit-learn/pull/26391 and https://github.com/scikit-learn/scikit-learn/pull/23595
What does this implement/fix? Explain your changes.
- Adds missing-value support to
RandomSplitterin _splitter.pyx - Enables the "random" splitter kwarg for ExtraTreeClassifier and ExtraTreeRegressor
- Adds unit-tests for ExtraTreeClassifier and ExtraTreeRegressor
Any other comments?
Compared to BestSplitter, there can be an expected cost to doing splits on missing-values, as we can either:
- choose a random threshold and then randomly send missing-values to left/right, OR
- send all missing-values to left/right and all non-missing values to right/left
This can be done randomly, OR the second option can actually be evaluated. The tradeoff imo comes from the assumption of the missing-values:
- if missing-completely-at-random (MCAR), then option 1 is ideal because one simply should ignore the missing values or impute them
- if missing-at-random (MAR), then option 2 is nice because sometimes the missing-ness of the data can be informative.
However, I think the difference at a tree level is not super important. E.g. in the ExtraTree forest, https://github.com/scikit-learn/scikit-learn/pull/28268 demonstrates that the ExtraTrees when combined as a forest are resilient and predictive for missing-values.
Benchmarks
There is some complexity involved in checking if there are missing values. However, this only occurs at the Python level as shown by the following benchmark. In terms of the Cython code, there is no regression.
Benchmarks with and without Python Check
Also ran this benchmark for ExtraTrees, which demonstrates that this check is negligible at the forest level, since it only occurs once. See https://github.com/scikit-learn/scikit-learn/pull/28268, which has the short code to enable it for ExtraTrees.
Benchmarks on ExtraTrees
✔️ Linting Passed
All linting checks passed. Your pull request is in excellent shape! ☀️
@thomasjpfan just a gentle ping to see if this is something you might be interested and have time for review now?
Hi @thomasjpfan just checking in to see if you still have the bandwidth to check this PR? This will consolidate towards the support of missing-values in decision trees throughout the package.
Just wanted to follow up on this thread so it doesn't get lost, as I believe it's an important feature.
- missing value support in extratree (the extratrees PR shows good performance when building a forest)
- with a unit-test failing on IsolationForest, I noticed we should also have to add this support to IsolationForest! Which is cool. The IsolationForest support of missing-values is nice as it previously would not be able to do so. As with all datasets with m missing values, whether or not this new feature helps on a dataset is up to the user.
The bulk of the complexity was actually accomplished by Thomas in his previous PRs to enable support of missing values in regular decision trees, so I anticipate this is not a huge reviewing hurdle.
cc: @glemaitre @thomasjpfan who has reviewed tree code in the past
OK, let me put it in my TODO. I think it would be nice to have a support for all tree-based models.
Maybe @thomasjpfan @adrinjalali have a bit of time to look at this PR.
Thanks @OmarManzoor for the review!
Perhaps @glemaitre and @thomasjpfan can take one last look then?
I think this has changed a bit since @glemaitre approved changes, and I believe I've addressed most of the core issues raised.
Thanks everyone for reviewing!
Sorry for the delay, I was a bit under the water. LGTM and I see the auto-merge was activated. Thanks @adam2392