scikit-tree icon indicating copy to clipboard operation
scikit-tree copied to clipboard

sparse / slow

Open jovo opened this issue 1 year ago • 10 comments
trafficstars

@adam2392 When @jdey4 runs MORF, it takes a long time and is slow. When we build the projection matrix in oblique trees, is it a sparse matrix? If not, can we make it be a sparse matrix, I believe if it is not a sparse matrix format, but it is a sparse matrix, then we can save a lot of RAM and time by using sparse.

jovo avatar Feb 21 '24 20:02 jovo

I asked @jdey4 if he could post a GH issue, so I'm unsure how he's running things. It is true that MORF is not very well tested and benchmarked currently.

https://github.com/neurodata/scikit-tree/blob/95e2597d5948c77bea565fc91b82e1a00b43cac8/sktree/tree/manifold/_morf_splitter.pyx#L273-L307 shows that the projection matrix is sparse format of handling a vector of their feature indices and vector of weights. Only non-zero weights are stored.

adam2392 avatar Feb 21 '24 21:02 adam2392

Possibly something for Edward's team et al. to consider? @jovo

It would be nice to have some measure of performance that we can run from n_samples 100 to >> 100.

adam2392 avatar Feb 22 '24 20:02 adam2392

Hi @adam2392 I used the following code snippet to train sporf: x_train, x_test, y_train, y_test = train_test_split( X, y, train_size=0.6, random_state=0, stratify=y) clf_sporf = ObliqueRandomForestClassifier(n_estimators=100, max_features=20)

X has a shape of (2368, 3498706) and the above code runs fine, takes about 50 mins to train. But if I increase the max feature to 100, it breaks my RAM (64 GB, apple M1 Max).

jdey4 avatar Feb 23 '24 16:02 jdey4

Ah I see. That's interesting. I wouldn't expect that to happen. How many trees are you training simultaneously?

adam2392 avatar Feb 23 '24 17:02 adam2392

For now I am using 100 trees, but I would love to use 1000 trees.

jdey4 avatar Feb 23 '24 17:02 jdey4

Sorry I am asking how many jobs are you training in parallel. I.e. if you're training 100 trees in parallel, I am less surprised that you're running out of RAM

adam2392 avatar Feb 23 '24 17:02 adam2392

Ah, I was using the default parameters, for which n_jobs=None.

jdey4 avatar Feb 23 '24 17:02 jdey4

Ah I see... that is then training 1 tree at a time. Can you inform:

  1. How deep is one tree?
  2. If you do clf.estimators_[0].tree_.get_projection_matrix(), what is an example of a projection matrix (maybe do a heat map(?) and what is the shape?

adam2392 avatar Feb 23 '24 20:02 adam2392

I tried MORF on brain MRI data with X.shape=(2206, 3498706). The server that I used has 754 GB of memory. I used only 1 worker and the code broke. When I try to fit MORF with 100 features, it works.

jdey4 avatar Mar 07 '24 21:03 jdey4

@adam2392 here is my code snippet. It works for max_patch_dims=(3,3,3). Screenshot 2024-03-07 at 5 44 27 PM

jdey4 avatar Mar 07 '24 22:03 jdey4