Probability is missing for FastForest binary classifier
As title
Originally posted by @andrasfuchs in https://github.com/dotnet/machinelearning-modelbuilder/issues/2042#issuecomment-1054694178
@LittleLittleCloud I wonder if it might be related to this dotnet/machinelearning#6087
Turns out the probability only misses when trainer is RandomForest, it might be because of missing calibrator at the end of pipeline. I'll push a fix for it.
Just take a closer look and it seems that although according to document, fastForest has probability as output column, but in source code, it doesn't...
@michaelgsharp Could it be a bug in mlnet?
This probably is a bug we will need to investigate. Can you create an issue in the ml.net repo for it?
I'm going to create an issue about Probability column missing for fast forest in mlnet, as for now @andrasfuchs what you can do is manually calibrate the result if the best model is fast forest. The recommended calibrator you can use is Platt
IDataView trainData, testData;
ITransformer model;
trainData = model.Transform(trainData);
var platt = context.BinaryClassification.Calibrators.Platt().Fit(trainData);
testData = model.Transform(testData); // testData doesn't have probability
testData = platt.Transform(testData); // now it has probability!!!
@LittleLittleCloud This is very useful. I have a follow-up question.
Q: The sample has var platt = context.BinaryClassification.Calibrators.Platt().Fit(**trainData**);
It is just a syntax sample so is it best to use all data, train data, or test data for fitting the calibrator?
Are there any concerns if running the calibrator for partial data? For example, in a biometric application can we run the calibration for a specific person only? I.e. train model with all data, then calibrate for each person separately.
is it best to use all data, train data, or test data for fitting the calibrator?
I would suggest using train data only, because there will be a data leakage ( for example, truth/false label #) according to the implementation of platt
Are there any concerns if running the calibrator for partial data?
I think it's fine, But be aware that in that case you are changing its prior distribution from the entire training sample to a partial of it. So as long as your test dataset also matches with that distribution you should be fine.
Adding a calibrator to FastForestBinaryTrainer will be a breaking due to the change of base class. Should the documentation be updated to reflect the status and provide steps to calibrate instead?