yggdrasil-decision-forests icon indicating copy to clipboard operation
yggdrasil-decision-forests copied to clipboard

Is there a way to add priors to a RF classifier?

Open JoseAF opened this issue 2 years ago • 4 comments

Hi

Does anyone know how to add priors (a priori class probabilities) to the class label values when creating a Random Forest classifier using yggdrasil? I haven't found any interface for priors, something similar to what opencv does here:

https://docs.opencv.org/3.4/d8/d89/classcv_1_1ml_1_1DTrees.html#a66756433f31db77a5511fc3f85403bd9

Many thanks.

JoseAF avatar Jun 13 '22 15:06 JoseAF

Hi Jose,

I am not an expert here. For the sake of interest, can you describe a bit more what and the use of this prior?

The RF model is "frequentist". The output is the average of the distribution stored in the leaves. Adding a constant distribution to all the leafs (or equivalently) averaging the model output distribution with a "prior distribution" is a way to bias the predictions in a certain direction.

Alternatively, the leaf values contains the count of training examples for each class. After training, you can iterate over the leafs and modify those distributions (e.g. add a given number of examples in a given class).

achoum avatar Jun 17 '22 08:06 achoum

Hi Mathieu

Thanks for the reply.

The interface provided by OpenCV to set priors has the purpose, I think, of increasing the weight of one of the classes, e.g. for when I want to minimise false positives. Effectively, the RF then 'gives priority' to getting right one of the classes above the other. As you say, a RF with many trees allows to do something somewhat similar by manipulating the resulting class probabilities. However, when the RF has very few trees (e.g. 2 or 3) this is not good enough. I have implemented bagging by random under-sampling on the less important class and this gives me something similar to what I wanted, but it's still not ideal (it's less robust, it requires more data manipulation and it depends on the number of training samples available). I'd very much prefer to incorporate the bias within the RF construction.

I hope this is a bit clearer...

JoseAF avatar Jun 21 '22 07:06 JoseAF

Hi Jose,

Thanks for the explanation :). (If I understand correctly), I think there might be better ways to achieve the same result. Here are some details:

All the learning algorithms (e.g. Random Forest, Gradient Boosted Trees) are training models that output probabilities. The predicted class is the class with the highest predicted probability. For example, if the output for the two classes C1 and C2 are {proba_C1=0.8, proba_C2=0.2}, the predicted class is C1.

Solution 1 In many production scenarios, we don't care about the most likely class. Instead, we use probabilities directly. For example, a software might be: If proba_C1>0.7 do X, else do Y. The threshold 0.7 is generally selected to satisfy some constraint. For example, precision at a given recall. If possible, I would try to do that.

Solution 2 Another interesting approach is the example weighting: Some examples might be more important than others (for various reasons). You can provide the weight of each example during training. If a class is more important than another, you can weigh the examples according to the class.

Regarding Random Forests, it is rare to train RF with so few trees. Such a process likely has a large training and quality variance. If speeds is important, Gradient Boosted Trees models are generally better.

If Random Forest gives better results than Gradient Boosted Trees (it is common), make sure to disable the winner takes all logic (this is one of the hyper-parameters). While there are some down-sides, your models will create a prediction with more resolution. For example, for a binary classification, A RF with 2 trees and with winner-takes-all enabled (default) can only predict the values 0, 0.5 and 1. Without winner-takes-all, more intermediate values will be predicted.

achoum avatar Jun 22 '22 06:06 achoum

Hi Mathieu

Thanks a lot for the message. I think what you mention there at the end about disabling winner-takes-all might help me here. At least it doesn't involve manipulating the training data. I'll give it a go!

JoseAF avatar Jun 22 '22 06:06 JoseAF