SPORF
SPORF copied to clipboard
Add metrics that can deal with imbalanced classes
Gini impurity is a fine metric when your dataset has balanced classes. However, in datasets that have large imbalanced classes, RerF runs into the same issue as all other random forest algorithms that uses Gini. That is, RerF just predicts majority class.
One possible way is to weight the Gini when computing the impurity. It can be defined as follows:
where w_i is the weight for each class and n_i is the number of samples belonging to class i in a particular node. w is subject to
One particular way to define w is to set it to the proportion of classes in dataset. There are probably better ways to deal with imbalanced classes, but this is certainly the simplest way.