ProgLearn icon indicating copy to clipboard operation
ProgLearn copied to clipboard

Improve Proglearn oblique tree speed

Open parthgvora opened this issue 4 years ago • 5 comments

Cythonize bottlenecks in proglearn oblique tree splitter to speed up training times. Main bottleneck is finding the next best split.

parthgvora avatar Jan 21 '21 21:01 parthgvora

please assign me to this

parthgvora avatar Jan 21 '21 21:01 parthgvora

I took a look at the sklearn cython tree code. I think a potential course of action is to basically copy over the sklearn.tree.DecisionTreeClassifier code and all the cython backend to the proglearn repo. Then modify the cython backend to add oblique splits and update the sklearn.tree.DecisionTreeClassifier class as need be.

This would allow for easy compatible integration with sklearn. It also uses their optimized code to the best extent.

Some thoughts about modifying their cython code, lots of stuff like sparse data and feature importance can be ignored to start. I think an MVP could simply sum together a sampled fixed number of features (equivalent to a dot product with weights of all 1). Verify this works, then we can iterate by adding weights/dot product instead of sum, controlling how many features to sample with a parameter in the constructor, etc.

rflperry avatar Jan 21 '21 23:01 rflperry

@rflperry I hold some objections for the copying idea, as copying the classifier code basically means copying over the entire sklearn.tree folder. I found it very hard to locate the codes we really need, and the easiest step for sklearn integration would be directly working on a fork.

That being said, @parthgvora if you need, I think I can provide some help about how sklearn currently structured its tree classifiers. And good luck on cython! It's pretty confusing in my opinion...

PSSF23 avatar Jan 22 '21 03:01 PSSF23

I personally am not adverse to copying over the entire folder. It seems in some ways easier than trying to write cython functions fully from scratch for parts of the forest. Of course that cython code has a lot of functionality we don't need to worry about modifying for now, but that can just be deleted and then added/modified later (i.e. sparse X case, feature importance, etc). But that's just my take, up to @parthgvora

rflperry avatar Jan 22 '21 14:01 rflperry

To be honest, I kind of think Hao is right here. Having tried this, there's a ton of stuff in the sklearn.tree folder we'd need to sort out to do what we need to do. Also, dealing with the GIL is a real pain in cython. So I'm kind of adverse to copying the folder directly.

That said, we should keep their code in mind when writing our own cython to make integration easier. We can also take some inspiration from the way they wrote their code as well, and do smaller copy-pastes as necessary.

So, I think we should cythonize piece by piece, starting with the splitter. We're probably going to need our own splitter for oblique splits, and its the slowest part anyways. So I'm gonna try to write this out in cython best I can, using their code for guidance at times. Then, if we want, we can keep adding sklearn functionality with small copy-pastes.

Also, implementing the splitter first should speed up the code enough to let someone implement and test MORF in python while we keep working on cythonizing.

parthgvora avatar Jan 22 '21 16:01 parthgvora