scikit-tree sparse / slow

trafficstars

@adam2392 When @jdey4 runs MORF, it takes a long time and is slow. When we build the projection matrix in oblique trees, is it a sparse matrix? If not, can we make it be a sparse matrix, I believe if it is not a sparse matrix format, but it is a sparse matrix, then we can save a lot of RAM and time by using sparse.

Feb 21 '24 20:02 jovo

I asked @jdey4 if he could post a GH issue, so I'm unsure how he's running things. It is true that MORF is not very well tested and benchmarked currently.

https://github.com/neurodata/scikit-tree/blob/95e2597d5948c77bea565fc91b82e1a00b43cac8/sktree/tree/manifold/_morf_splitter.pyx#L273-L307 shows that the projection matrix is sparse format of handling a vector of their feature indices and vector of weights. Only non-zero weights are stored.

Feb 21 '24 21:02 adam2392

Possibly something for Edward's team et al. to consider? @jovo

It would be nice to have some measure of performance that we can run from n_samples 100 to >> 100.

Feb 22 '24 20:02 adam2392

Hi @adam2392 I used the following code snippet to train sporf: x_train, x_test, y_train, y_test = train_test_split( X, y, train_size=0.6, random_state=0, stratify=y) clf_sporf = ObliqueRandomForestClassifier(n_estimators=100, max_features=20)

X has a shape of (2368, 3498706) and the above code runs fine, takes about 50 mins to train. But if I increase the max feature to 100, it breaks my RAM (64 GB, apple M1 Max).

Feb 23 '24 16:02 jdey4

Ah I see. That's interesting. I wouldn't expect that to happen. How many trees are you training simultaneously?

Feb 23 '24 17:02 adam2392

For now I am using 100 trees, but I would love to use 1000 trees.

Feb 23 '24 17:02 jdey4

Sorry I am asking how many jobs are you training in parallel. I.e. if you're training 100 trees in parallel, I am less surprised that you're running out of RAM

Feb 23 '24 17:02 adam2392

Ah, I was using the default parameters, for which n_jobs=None.

Feb 23 '24 17:02 jdey4

Ah I see... that is then training 1 tree at a time. Can you inform:

How deep is one tree?
If you do clf.estimators_[0].tree_.get_projection_matrix(), what is an example of a projection matrix (maybe do a heat map(?) and what is the shape?

Feb 23 '24 20:02 adam2392

I tried MORF on brain MRI data with X.shape=(2206, 3498706). The server that I used has 754 GB of memory. I used only 1 worker and the code broke. When I try to fit MORF with 100 features, it works.

Mar 07 '24 21:03 jdey4

@adam2392 here is my code snippet. It works for max_patch_dims=(3,3,3). Screenshot 2024-03-07 at 5 44 27 PM

Mar 07 '24 22:03 jdey4

scikit-tree scikit-tree copied to clipboard

sparse / slow

scikit-tree
scikit-tree copied to clipboard