scikit-tree
scikit-tree copied to clipboard
sparse / slow
@adam2392 When @jdey4 runs MORF, it takes a long time and is slow. When we build the projection matrix in oblique trees, is it a sparse matrix? If not, can we make it be a sparse matrix, I believe if it is not a sparse matrix format, but it is a sparse matrix, then we can save a lot of RAM and time by using sparse.
I asked @jdey4 if he could post a GH issue, so I'm unsure how he's running things. It is true that MORF is not very well tested and benchmarked currently.
https://github.com/neurodata/scikit-tree/blob/95e2597d5948c77bea565fc91b82e1a00b43cac8/sktree/tree/manifold/_morf_splitter.pyx#L273-L307 shows that the projection matrix is sparse format of handling a vector of their feature indices and vector of weights. Only non-zero weights are stored.
Possibly something for Edward's team et al. to consider? @jovo
It would be nice to have some measure of performance that we can run from n_samples 100 to >> 100.
Hi @adam2392 I used the following code snippet to train sporf: x_train, x_test, y_train, y_test = train_test_split( X, y, train_size=0.6, random_state=0, stratify=y) clf_sporf = ObliqueRandomForestClassifier(n_estimators=100, max_features=20)
X has a shape of (2368, 3498706) and the above code runs fine, takes about 50 mins to train. But if I increase the max feature to 100, it breaks my RAM (64 GB, apple M1 Max).
Ah I see. That's interesting. I wouldn't expect that to happen. How many trees are you training simultaneously?
For now I am using 100 trees, but I would love to use 1000 trees.
Sorry I am asking how many jobs are you training in parallel. I.e. if you're training 100 trees in parallel, I am less surprised that you're running out of RAM
Ah, I was using the default parameters, for which n_jobs=None.
Ah I see... that is then training 1 tree at a time. Can you inform:
- How deep is one tree?
- If you do
clf.estimators_[0].tree_.get_projection_matrix(), what is an example of a projection matrix (maybe do a heat map(?) and what is the shape?
I tried MORF on brain MRI data with X.shape=(2206, 3498706). The server that I used has 754 GB of memory. I used only 1 worker and the code broke. When I try to fit MORF with 100 features, it works.
@adam2392 here is my code snippet. It works for max_patch_dims=(3,3,3).