scikit-tree
scikit-tree copied to clipboard
Sparse Implementation of `build_coleman_forest` plus a fix for bottleneck import
Reference Issues/PRs
Fixes #310
What does this implement/fix? Explain your changes.
This adds a sparse implementation of build_coleman_forest to treeple.stats. The sparse implementation relies on scipy.stats.sparse and so only works in the case of binary classification and regression. In my tests using cProfile/memray, the sparse implementation is a little over 50% faster and uses about %7 less memory, see pdf for more information. I am also attaching the profiling data
if you'd like to take a look.
This PR also includres a change to the bottleneck implementation to for the dense implementation of build_coleman_forest. Specifically there are two important changes:
- The check for the existence of the bottleneck package is now done using
importlib.util.find_spec("bottleneck"). The old check usingsysseemed to work fine and passed tests, but actually didn't seem work unless bottleneck had been imported before, which when the tests are run is true, but may not be the case in other situations - the aliased and lambda functions
nanmean_fandanynan_fare now defined usingdefso that the warning message can moved into their respective calls rather than at import.
Any other comments?
The scipy.stats sparse implementation should be swapped out for pydata.sparse when that implementation is performant enough, as it is more general.
Codecov Report
Attention: Patch coverage is 89.79592% with 10 lines in your changes missing coverage. Please review.
Project coverage is 80.27%. Comparing base (
1d970b0) to head (d733508). Report is 3 commits behind head on main.
| Files | Patch % | Lines |
|---|---|---|
| treeple/stats/forest.py | 81.39% | 5 Missing and 3 partials :warning: |
| treeple/stats/utils.py | 96.36% | 0 Missing and 2 partials :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #317 +/- ##
==========================================
+ Coverage 80.00% 80.27% +0.26%
==========================================
Files 24 24
Lines 2221 2296 +75
Branches 411 422 +11
==========================================
+ Hits 1777 1843 +66
- Misses 312 317 +5
- Partials 132 136 +4
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@adam2392 I have updated the PR, let me know if I missed anything or if you want to see some more changes.
@adam2392 I added some documentation to address your comments. Let me know what you think.
I will let @sampan501 @SUKI-O @YuxinB @PSSF23 take a look as well.
@sampan501 @adam2392 Hey I added a test, but it looks like I some other tests are failing. I submitted an issue, #328. I'll start looking into it, but it might go faster if you have an intuition on what the issue is.
This is gonna be an issue due to parameter validation in scikit-learn. I will push a fix when I have time. It seems the rest of the code works tho.
I would prioritize other issues until I have the time to push up the fix
@adam2392 sounds good. Should we let this sit and update and merge after you have a fix for the other issue or merge as is? @sampan501
I'm happy to merge if @adam2392 is cool with it
@adam2392 sounds good. Should we let this sit and update and merge after you have a fix for the other issue or merge as is? @sampan501
I will merge once the fix is added and the CI is confirmed to be happy. No rush on this one.
@ryanhausen if you update wrt main, the CIs should be fixed
Thanks @ryanhausen !