scikit-learn
scikit-learn copied to clipboard
ENH Monotonic Contraints for Tree-based models
Continuation of PR #7266 addressing issue #6656.
TODO:
- [x] ~implement logic for new trees in #12807~ (already done by #15582)
- [x] discuss API (not sure about introducing 2 new kwargs
increasinganddecreasing) => let's go with themontonic_cstarray - [x] add tests
- [x] improve tests as the current ones don't fail despite erroneous implementation
Very interested in this, please ping me when you have something!
I added a few tests to check that the implementation actually enforced the monotonicity constraints, but they are failing... Unless my tests are wrong, this suggests that the current implementation is not correct.
Two issues:
- montonicity for classification is with respect to class 0 => switch to class 1 ?
- a small fraction of rows (~5% for 1000 rows, which is hidden in the current tests...) fail to satisfy the monotonicity constraints -- I suspect this could be due to unfortunate
float64tofloat32conversions at during thepredictcall
@samronsin LMK when you need reviews.
I've implemented monotonic constraints for the hist-GBDTs in https://github.com/scikit-learn/scikit-learn/pull/15582
It's still under review so your comments are more than welcome there. I think you'll also find some tests worth using. They helped me a great deal in debugging my code
montonicity for classification is with respect to class 0 => switch to class 1 ?
Yes. Maybe this be done by just reversing the constraint? Not sure
a small fraction of rows (~5% for 1000 rows, which is hidden in the current tests...) fail to satisfy the monotonicity constraints -- I suspect this could be due to unfortunate float64 to float32 conversions at during the predict call
You don't seem to be constraining the values of the children based on the average value of a pair of siblings. You'll need to introduce lower and upper bounds to the nodes values for monotonic constraints to be properly enforced (see https://github.com/scikit-learn/scikit-learn/pull/15582/files#diff-9bd2ee07fb6817a0aee54f959d32de8aR139).
Thanks a lot for your comments and the pointers @NicolasHug ! You are totally right, the current constraints in this PR are not sufficient to guarantee monotonicity. I don't see any way around adding lower and upper bounds to the Splitter in tree._splitter just like you did for the Splitter in ensemble._hist_gradient_boosting.splitting.
There is still a question regarding the choice of the values of these bounds, but I will most likely stick to using the parent's value for each new "inner" bound.
I also think the monotonic_cst parameter is better than having two lists of indices for decreasing and increasing constraints, so I will switch to this in order to keep consistency between the two APIs.
@NicolasHug I have a working version if you want to have a look.
I will try to run profiling on trees as they are right now in this PR and in master to assess if the performance is perceptibly hurt by the systematic call to check_monotonicity.
If this is the case, I can add a boolean skip_monotonicity in the splitters and tree builders to bypass these calls.
So tests look good with DecisionTreeClassifier and DecisionTreeRegressor but fail half of the time with ExtraTreeClassifier and ExtraTreeRegressor.
I might have missed a bit of the implementation !
I added back the check on the bounds even for unconstraint features, because it is actually needed to make sure other (unconstrained) features don't corrupt the tree as explained in this comment.
There is still an issue with all the TreeClassifiers, but the regression trees pass all the tests.
This PR also needs an entry in the changelog in doc/whats_new/v1.0.rst.
You might want to use the instructions of #20301 to solve the conflicts caused by code style reformatting.
Since 1.0 has been release, the change log should be updated in doc/whats_new/v1.1.rst instead.
@samronsin @jjerphan may I ask about the next steps here? If I correctly followed the issues, this would be the basis to have random forests with monotonic constraints. https://github.com/scikit-learn/scikit-learn/issues/18982
@samronsin @jjerphan may I ask about the next steps here? If I correctly followed the issues, this would be the basis to have random forests with monotonic constraints. #18982
That's a first step, yes.
Actually @mayer79 this PR already implements monotonicity constraints for trees ensembles (including Random Forests), so this would be it, no next step.
I'll address the last comments (thanks @jjerphan for the review !) to make it ready to merge asap.
Actually @mayer79 this PR already implements monotonicity constraints for trees ensembles (including Random Forests), so this would be it, no next step.
In this case, can this PR title be updated to mention ensemble models?
I'll address the last comments (thanks @jjerphan for the review !) to make it ready to merge asap.
Thanks! You might want to ping another reviewer explicitly then.
Actually @mayer79 this PR already implements monotonicity constraints for trees ensembles (including Random Forests), so this would be it, no next step.
I'll address the last comments (thanks @jjerphan for the review !) to make it ready to merge asap.
So cool - this is what I wanted to hear. Thanks a lot for the big effort!
@mayer79 Do you want to review?
Eventually ran a quick benchmark. The results are not great, the impact on fit time especially for sparse input is quite noticeable... What do you think @ogrisel @jjerphan ?
ensemble.GradientBoostingClassifierBenchmark.peakmem_fit
================ ======= =======
branch main dev
---------------- ------- -------
representation
---------------- ------- -------
dense 85.9M 87.2M
sparse 128M 129M
================ ======= =======
ensemble.GradientBoostingClassifierBenchmark.peakmem_predict
================ ======= =======
branch main dev
---------------- ------- -------
representation
---------------- ------- -------
dense 80.4M 81.7M
sparse 90.5M 91.8M
================ ======= =======
ensemble.GradientBoostingClassifierBenchmark.time_fit
================ ============ ============
branch main dev
---------------- ------------ ------------
representation
---------------- ------------ ------------
dense 2.49±0.03s 2.55±0.03s
sparse 6.42±0.1s 7.40±0.1s
================ ============ ============
ensemble.GradientBoostingClassifierBenchmark.time_predict
================ ========== ==========
branch main dev
---------------- ---------- ----------
representation
---------------- ---------- ----------
dense 55.8±3ms 56.0±2ms
sparse 38.7±2ms 38.3±1ms
================ ========== ==========
ensemble.RandomForestClassifierBenchmark.peakmem_fit
================ ====== ======
branch main dev
---------------- ------ ------
representation
================ ====== ======
dense 168M 169M
sparse 385M 386M
================ ====== ======
ensemble.RandomForestClassifierBenchmark.peakmem_predict
================ ====== ======
branch main dev
---------------- ------ ------
representation
================ ====== ======
dense 168M 169M
sparse 385M 386M
================ ====== ======
ensemble.RandomForestClassifierBenchmark.time_fit
================ ============ ============
branch main dev
---------------- ------------ ------------
representation
================ ============ ============
dense 5.96±0.08s 7.34±0.04s
sparse 11.0±0.1s 16.0±0.2s
================ ============ ============
ensemble.RandomForestClassifierBenchmark.time_predict
================ ============ ============
branch main dev
---------------- ------------ ------------
representation
================ ============ ============
dense 207±7ms 209±4ms
sparse 1.43±0.04s 1.47±0.03s
================ ============ ============
Eventually ran a quick benchmark. The results are not great, the impact on fit time especially for sparse input is quite noticeable... What do you think @ogrisel @jjerphan ?
Do you confirm that those are numbers with monotonic_cst left to the default value in the PR (no active constraints)?
The performance impact on fitting random forests when no constraints are active on any features is too significant to be acceptable.
I think the cause of this overhead it that we check https://github.com/scikit-learn/scikit-learn/pull/13649/files#r417995287 .
There should be a bypass to disable this check when monotonic_cst is 0 for all the features.
Do you confirm that those are numbers with
monotonic_cstleft to the default value in the PR (no active constraints)?
Yes, no active constraint on the benchmarks.
There should be a bypass to disable this check when monotonic_cst is 0 for all the features.
Yes I agree, thanks for the suggestion !
03ee5f4 seems to have fixed performance regressions:
asv benchmarks
→ asv continuous -b RandomForest -b GradientBoostingClassifier main monotonic-trees
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
·· Installing 03ee5f40 <monotonic-trees> into conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl.
· Running 36 total benchmarks (2 commits * 1 environments * 18 benchmarks)
[ 0.00%] · For scikit-learn commit 03ee5f40 <monotonic-trees> (round 1/1):
[ 0.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 2.78%] ··· Setting up ensemble:64 ok
[ 2.78%] ··· ensemble.GradientBoostingClassifierBenchmark.peakmem_fit ok
[ 2.78%] ··· ================ =======
representation
---------------- -------
dense 77.4M
sparse 119M
================ =======
[ 5.56%] ··· ensemble.GradientBoostingClassifierBenchmark.peakmem_predict ok
[ 5.56%] ··· ================ =======
representation
---------------- -------
dense 75.1M
sparse 84.6M
================ =======
[ 8.33%] ··· ensemble.GradientBoostingClassifierBenchmark.time_fit ok
[ 8.33%] ··· ================ ============
representation
---------------- ------------
dense 2.58±0.03s
sparse 7.09±0.1s
================ ============
[ 11.11%] ··· ensemble.GradientBoostingClassifierBenchmark.time_predict ok
[ 11.11%] ··· ================ ============
representation
---------------- ------------
dense 49.9±4ms
sparse 30.3±0.3ms
================ ============
[ 13.89%] ··· ensemble.GradientBoostingClassifierBenchmark.track_test_score ok
[ 13.89%] ··· ================ =====================
representation
---------------- ---------------------
dense 0.5456669046657192
sparse 0.10409974329281042
================ =====================
[ 16.67%] ··· ensemble.GradientBoostingClassifierBenchmark.track_train_score ok
[ 16.67%] ··· ================ =====================
representation
---------------- ---------------------
dense 0.631552440458505
sparse 0.15180008167538628
================ =====================
[ 19.44%] ··· Setting up ensemble:103 ok
[ 19.44%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_fit 87.1M
[ 22.22%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_predict 76.2M
[ 25.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_fit 4.34±0.6s
[ 27.78%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_predict 136±9ms
[ 30.56%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_test_score 0.7230709112942986
[ 33.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_train_score 0.9812160155622751
[ 36.11%] ··· Setting up ensemble:24 ok
[ 36.11%] ··· ensemble.RandomForestClassifierBenchmark.peakmem_fit ok
[ 36.11%] ··· ================ ======
-- n_jobs
---------------- ------
representation 1
================ ======
dense 163M
sparse 378M
================ ======
[ 38.89%] ··· ensemble.RandomForestClassifierBenchmark.peakmem_predict ok
[ 38.89%] ··· ================ ======
-- n_jobs
---------------- ------
representation 1
================ ======
dense 164M
sparse 378M
================ ======
[ 41.67%] ··· ensemble.RandomForestClassifierBenchmark.time_fit ok
[ 41.67%] ··· ================ ===========
-- n_jobs
---------------- -----------
representation 1
================ ===========
dense 6.50±0.3s
sparse 12.9±0.5s
================ ===========
[ 44.44%] ··· ensemble.RandomForestClassifierBenchmark.time_predict ok
[ 44.44%] ··· ================ ============
-- n_jobs
---------------- ------------
representation 1
================ ============
dense 262±8ms
sparse 1.38±0.01s
================ ============
[ 47.22%] ··· ensemble.RandomForestClassifierBenchmark.track_test_score ok
[ 47.22%] ··· ================ ====================
-- n_jobs
---------------- --------------------
representation 1
================ ====================
dense 0.7526871247608341
sparse 0.8656423941766682
================ ====================
[ 50.00%] ··· ensemble.RandomForestClassifierBenchmark.track_train_score ok
[ 50.00%] ··· ================ ====================
-- n_jobs
---------------- --------------------
representation 1
================ ====================
dense 0.997026844012917
sparse 0.9996123288718864
================ ====================
[ 50.00%] · For scikit-learn commit 80ebe21e <main> (round 1/1):
[ 50.00%] ·· Building for conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl..
[ 50.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 52.78%] ··· Setting up ensemble:64 ok
[ 52.78%] ··· ensemble.GradientBoostingClassifierBenchmark.peakmem_fit ok
[ 52.78%] ··· ================ =======
representation
---------------- -------
dense 77.6M
sparse 118M
================ =======
[ 55.56%] ··· ensemble.GradientBoostingClassifierBenchmark.peakmem_predict ok
[ 55.56%] ··· ================ =======
representation
---------------- -------
dense 75.5M
sparse 85M
================ =======
[ 58.33%] ··· ensemble.GradientBoostingClassifierBenchmark.time_fit ok
[ 58.33%] ··· ================ ===========
representation
---------------- -----------
dense 2.50±0.1s
sparse 7.65±0.1s
================ ===========
[ 61.11%] ··· ensemble.GradientBoostingClassifierBenchmark.time_predict ok
[ 61.11%] ··· ================ ============
representation
---------------- ------------
dense 51.3±0.9ms
sparse 31.0±1ms
================ ============
[ 63.89%] ··· ensemble.GradientBoostingClassifierBenchmark.track_test_score ok
[ 63.89%] ··· ================ =====================
representation
---------------- ---------------------
dense 0.5456669046657192
sparse 0.10409974329281042
================ =====================
[ 66.67%] ··· ensemble.GradientBoostingClassifierBenchmark.track_train_score ok
[ 66.67%] ··· ================ =====================
representation
---------------- ---------------------
dense 0.631552440458505
sparse 0.15180008167538628
================ =====================
[ 69.44%] ··· Setting up ensemble:103 ok
[ 69.44%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_fit 87.5M
[ 72.22%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_predict 76.3M
[ 75.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_fit 3.15±0.3s
[ 77.78%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_predict 127±20ms
[ 80.56%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_test_score 0.7230709112942986
[ 83.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_train_score 0.9812160155622751
[ 86.11%] ··· Setting up ensemble:24 ok
[ 86.11%] ··· ensemble.RandomForestClassifierBenchmark.peakmem_fit ok
[ 86.11%] ··· ================ ======
-- n_jobs
---------------- ------
representation 1
================ ======
dense 163M
sparse 378M
================ ======
[ 88.89%] ··· ensemble.RandomForestClassifierBenchmark.peakmem_predict ok
[ 88.89%] ··· ================ ======
-- n_jobs
---------------- ------
representation 1
================ ======
dense 164M
sparse 378M
================ ======
[ 91.67%] ··· ensemble.RandomForestClassifierBenchmark.time_fit ok
[ 91.67%] ··· ================ ===========
-- n_jobs
---------------- -----------
representation 1
================ ===========
dense 6.20±0.1s
sparse 11.2±0.1s
================ ===========
[ 94.44%] ··· ensemble.RandomForestClassifierBenchmark.time_predict ok
[ 94.44%] ··· ================ ============
-- n_jobs
---------------- ------------
representation 1
================ ============
dense 270±5ms
sparse 1.38±0.04s
================ ============
[ 97.22%] ··· ensemble.RandomForestClassifierBenchmark.track_test_score ok
[ 97.22%] ··· ================ ====================
-- n_jobs
---------------- --------------------
representation 1
================ ====================
dense 0.7526871247608341
sparse 0.8656423941766682
================ ====================
[100.00%] ··· ensemble.RandomForestClassifierBenchmark.track_train_score ok
[100.00%] ··· ================ ====================
-- n_jobs
---------------- --------------------
representation 1
================ ====================
dense 0.997026844012917
sparse 0.9996123288718864
================ ====================
BENCHMARKS NOT SIGNIFICANTLY CHANGED.
Removing from the milestone since it's stalled for now
@samronsin: thanks for pursuing your work. :handshake:
Do you need help with the recent refactoring changes made on the heaps?
Thanks for your suggestions @jjerphan !
I managed to have most of it working by simply moving the check_monotonicity inline method from Splitter to Criterion. This seems required because the types of sum_left, sum_right and sum_total are now different in RegressionCriterion and ClassificationCriterion.
One notable exception of things working well is ExtraTree{Classifier,Regressor} with sparse inputs (as seen on test_tree.py)... Not sure how hard this is to fix, and whether having another branch would help or not.
It should be OK now 😅 !
One notable exception of things working well is
ExtraTree{Classifier,Regressor}with sparse inputs (as seen ontest_tree.py)...
This makes me think that we currently don't test any of the sparse splitters in test_monotonic_tree.py so this should probably be considered as well.
There was a low level Windows fatal exception: access violation during the execution the fit call at line 949 in test_min_impurity_decrease on one Windows worker and a similar problem in at line 330 test_nd_tree_nodes_values:
Current thread 0x00001410 (most recent call first):
File "C:\Miniconda\envs\testvenv\lib\site-packages\sklearn\tree\_classes.py", line 495 in fit
File "C:\Miniconda\envs\testvenv\lib\site-packages\sklearn\tree\_classes.py", line 1015 in fit
File "C:\Miniconda\envs\testvenv\lib\site-packages\sklearn\tree\tests\test_tree.py", line 949 in test_min_impurity_decrease
Current thread 0x00001b50 (most recent call first):
File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\sklearn\tree\_classes.py", line 495 in fit
File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\sklearn\tree\_classes.py", line 1395 in fit
File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\sklearn\tree\tests\test_monotonic_tree.py", line 330 in test_nd_tree_nodes_values
[...]
The Linux and macOS worker completed successfully so it seems to be a Windows specific problem.
Line 495 of sklearn/tree/_classes.py is:
builder.build(self.tree_, X, y, sample_weight)
but then the Python-level crash report does not report the native backtrace unfortunately. I guess we will need a Windows VM to investigate what's going on but even with that I am not familiar enough with Windows compiler and debugging tools to know how to get such a post-mortem backtrace.
There was a low level
Windows fatal exception: access violationduring the execution thefitcall at line 949 intest_min_impurity_decreaseon one Windows worker and a similar problem in at line 330test_nd_tree_nodes_values:Current thread 0x00001410 (most recent call first): File "C:\Miniconda\envs\testvenv\lib\site-packages\sklearn\tree\_classes.py", line 495 in fit File "C:\Miniconda\envs\testvenv\lib\site-packages\sklearn\tree\_classes.py", line 1015 in fit File "C:\Miniconda\envs\testvenv\lib\site-packages\sklearn\tree\tests\test_tree.py", line 949 in test_min_impurity_decreaseCurrent thread 0x00001b50 (most recent call first): File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\sklearn\tree\_classes.py", line 495 in fit File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\sklearn\tree\_classes.py", line 1395 in fit File "C:\hostedtoolcache\windows\Python\3.8.10\x86\lib\site-packages\sklearn\tree\tests\test_monotonic_tree.py", line 330 in test_nd_tree_nodes_values [...]The Linux and macOS worker completed successfully so it seems to be a Windows specific problem.
Huh, I'm unable to reproduce on my Windows system. Maybe let's merge with main and let the CI run again?
System Info
System:
python: 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]
executable: R:\ProgramFiles\anaconda3\envs\scikit-dev\python.exe
machine: Windows-10-10.0.19043-SP0
Python dependencies:
sklearn: 1.2.dev0
pip: 21.2.4
setuptools: 58.0.4
numpy: 1.20.3
scipy: 1.7.3
Cython: 0.29.26
pandas: 1.3.5
matplotlib: 3.5.1
joblib: 1.1.0
threadpoolctl: 3.0.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: R:\ProgramFiles\anaconda3\envs\scikit-dev\Lib\site-packages\numpy\.libs\libopenblas.EL2C6PLE4ZYW3ECEVIV3OXXGRN2NRFM2.gfortran-win_amd64.dll
version: 0.3.18
threading_layer: pthreads
architecture: Zen
num_threads: 12
user_api: openmp
internal_api: openmp
prefix: vcomp
filepath: C:\Windows\System32\vcomp140.dll
version: None
num_threads: 12