scikit-tree
scikit-tree copied to clipboard
"Coleman" approach to feature_importances_
The basic idea to get a feature_importances distribution map from the Coleman approach is:
- train one forest on X
- train another forest on permuted X
- compute feature_importances array across all trees (n_trees, n_features_in,) for both forests
- do the "Coleman" idea and resample from both M times
You have your feature importances null that you can compare against the feature_importances_ array in the first forest in step 1.
Code that builds a Coleman forest approach for doing multivariate hypothesis testing: https://github.com/neurodata/scikit-tree/blob/95e2597d5948c77bea565fc91b82e1a00b43cac8/sktree/stats/forestht.py#L1222
cc: @jovo @jdey4
for step 4, i think we just compute the distribution of feature importance under the null, and then, we can compute a p-value for the importance of each feature under the alternative, right?
Oh I guess the permuted forest technically gives that(?), but I was assuming you wanted like M forests each with a slightly different feature_importances map constructed from a different collection of trees?
oh, I thought just 1 null forest. We compute feature_importance for all the features.
We need M forests for the p-value computation for two-sample testing, but I don't think we need more than 1 forest for this?
@jovo from the above steps as mentioned by @adam2392, I thought we wanted distribution of feature_importance score. But if I understood it correctly today, you want rank, right? I get rank from the permuted forest and then get rank from non-permuted forest and count the number of times each feature ranks higher in non-permuted one than that in the permuted case? Should I repeat the process several times or you want to subsample after training a random forest with huge number of trees? I repeated the experiment for several reps as the feature dimension is 1.5 million and there is higher variance in forest with 100 trees.
@jdey4 write pseudocode so we are super clear, then i can quibble anything i don't like.
Steps:
- Consider n iid samples $D_n = (x_i, y_i)_{i=1}^n$ and permuted labels data $\tilde{D}_n$.
- Train 2 random forests (B trees each) with $D_n$ and $\tilde{D}_n$ : RF and RF_0, respectively.
- Consider a specific feature $F_j$. Calculate it's rank from RF: r
- Calculate rank for $F_j$ from $RF_0$: $r_0$.
- Calculate $r_0 - r$.
- Now randomly sample B trees from ${RF, RF_0}: RF*$ and denote the rest of the trees as $RF*_0$.
- Caculate $r*_0-r*$ as 6.
- Repeat 6 and 7 $N_0$ times.
- Calculate $p = \frac{1}{N_0+1} [1 + \sum I((r_0-r) \leq (r*_0-r*))]$. Thoughts @jovo @adam2392 ? Should I make a PR somewhere in sktree or first make sure it works in the CMI repo?