scikit-learn FEA Add missing-value support for ExtaTreeClassifier and ExtaTreeRegressor

trafficstars

Reference Issues/PRs

Towards: #27931

Follow-up to: https://github.com/scikit-learn/scikit-learn/pull/26391 and https://github.com/scikit-learn/scikit-learn/pull/23595

What does this implement/fix? Explain your changes.

Adds missing-value support to RandomSplitter in _splitter.pyx
Enables the "random" splitter kwarg for ExtraTreeClassifier and ExtraTreeRegressor
Adds unit-tests for ExtraTreeClassifier and ExtraTreeRegressor

Any other comments?

Compared to BestSplitter, there can be an expected cost to doing splits on missing-values, as we can either:

choose a random threshold and then randomly send missing-values to left/right, OR
send all missing-values to left/right and all non-missing values to right/left

This can be done randomly, OR the second option can actually be evaluated. The tradeoff imo comes from the assumption of the missing-values:

if missing-completely-at-random (MCAR), then option 1 is ideal because one simply should ignore the missing values or impute them
if missing-at-random (MAR), then option 2 is nice because sometimes the missing-ness of the data can be informative.

However, I think the difference at a tree level is not super important. E.g. in the ExtraTree forest, https://github.com/scikit-learn/scikit-learn/pull/28268 demonstrates that the ExtraTrees when combined as a forest are resilient and predictive for missing-values.

Benchmarks

There is some complexity involved in checking if there are missing values. However, this only occurs at the Python level as shown by the following benchmark. In terms of the Cython code, there is no regression.

Benchmarks with and without Python Check

results_image_without_pythoncheck

Also ran this benchmark for ExtraTrees, which demonstrates that this check is negligible at the forest level, since it only occurs once. See https://github.com/scikit-learn/scikit-learn/pull/28268, which has the short code to enable it for ExtraTrees.

Benchmarks on ExtraTrees

results_image

Dec 16 '23 03:12 adam2392

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 20e9d9f. Link to the linter CI: here}

Dec 16 '23 03:12 github-actions[bot]

@thomasjpfan just a gentle ping to see if this is something you might be interested and have time for review now?

Feb 13 '24 17:02 adam2392

Hi @thomasjpfan just checking in to see if you still have the bandwidth to check this PR? This will consolidate towards the support of missing-values in decision trees throughout the package.

Apr 15 '24 08:04 adam2392

Just wanted to follow up on this thread so it doesn't get lost, as I believe it's an important feature.

missing value support in extratree (the extratrees PR shows good performance when building a forest)
with a unit-test failing on IsolationForest, I noticed we should also have to add this support to IsolationForest! Which is cool. The IsolationForest support of missing-values is nice as it previously would not be able to do so. As with all datasets with m missing values, whether or not this new feature helps on a dataset is up to the user.

The bulk of the complexity was actually accomplished by Thomas in his previous PRs to enable support of missing values in regular decision trees, so I anticipate this is not a huge reviewing hurdle.

cc: @glemaitre @thomasjpfan who has reviewed tree code in the past

May 23 '24 19:05 adam2392

OK, let me put it in my TODO. I think it would be nice to have a support for all tree-based models.

May 24 '24 08:05 glemaitre

Maybe @thomasjpfan @adrinjalali have a bit of time to look at this PR.

Jun 20 '24 13:06 glemaitre

Thanks @OmarManzoor for the review!

Jul 02 '24 12:07 adam2392

Perhaps @glemaitre and @thomasjpfan can take one last look then?

I think this has changed a bit since @glemaitre approved changes, and I believe I've addressed most of the core issues raised.

Thanks everyone for reviewing!

Jul 08 '24 14:07 adam2392

Sorry for the delay, I was a bit under the water. LGTM and I see the auto-merge was activated. Thanks @adam2392

Jul 09 '24 08:07 glemaitre

scikit-learn scikit-learn copied to clipboard

FEA Add missing-value support for ExtaTreeClassifier and ExtaTreeRegressor

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Benchmarks

✔️ Linting Passed

scikit-learn
scikit-learn copied to clipboard