smartcore icon indicating copy to clipboard operation
smartcore copied to clipboard

Variable Importance in Random Forest Analysis

Open jblindsay opened this issue 2 years ago • 4 comments

I believe it is common in Random Forest analyses for the variable importance to be reported. For example, variable importance can be determined using the mean decrease in accuracy that occurs when each variable is removed, or using the Gini impurity stat. I may be mistaken, but I do not currently see any means by which this information can be measured using the current SmartCore API. I believe this would make a very valuable addition to the library.

jblindsay avatar Dec 18 '21 23:12 jblindsay

Hi there, does this issue still need a contribution, and may I work on it? @jblindsay @Mec-iS

tushushu avatar Jan 14 '24 09:01 tushushu

The Random Forest feature importance is calculated by compute_feature_importances function and other functions in sklearn, which is implemented by Cython.

    cpdef compute_feature_importances(self, normalize=True):
        """Computes the importance of each feature (aka variable)."""
        cdef Node* left
        cdef Node* right
        cdef Node* nodes = self.nodes
        cdef Node* node = nodes
        cdef Node* end_node = node + self.node_count

        cdef float64_t normalizer = 0.

        cdef cnp.float64_t[:] importances = np.zeros(self.n_features)

        with nogil:
            while node != end_node:
                if node.left_child != _TREE_LEAF:
                    # ... and node.right_child != _TREE_LEAF:
                    left = &nodes[node.left_child]
                    right = &nodes[node.right_child]

                    importances[node.feature] += (
                        node.weighted_n_node_samples * node.impurity -
                        left.weighted_n_node_samples * left.impurity -
                        right.weighted_n_node_samples * right.impurity)
                node += 1

        for i in range(self.n_features):
            importances[i] /= nodes[0].weighted_n_node_samples

        if normalize:
            normalizer = np.sum(importances)

            if normalizer > 0.0:
                # Avoid dividing by zero (e.g., when root is pure)
                for i in range(self.n_features):
                    importances[i] /= normalizer

        return np.asarray(importances)

And calculate the node impurity by function cdef float64_t node_impurity(self) noexcept nogil, which supports MSE, MAE, Gini, Poisson and cross-entropy.

I am going to check how we can implement similar functions in smartcore.

tushushu avatar Jan 19 '24 03:01 tushushu

The way to calculate feature importance can be found in this article. https://towardsdatascience.com/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3

tushushu avatar Jan 27 '24 10:01 tushushu

This issue can be split into 3 sub tasks. I am working on the first one, currently. See https://github.com/tushushu/smartcore/tree/wip-issue-124

  • [x] Implement the feature importance for Decision Tree Classifier
  • [ ] Implement the feature importance for Decision Tree Regressor
  • [ ] Implement the feature importance for Random Forest

tushushu avatar Feb 02 '24 11:02 tushushu