machinelearning Probability is missing for FastForest binary classifier

As title

Originally posted by @andrasfuchs in https://github.com/dotnet/machinelearning-modelbuilder/issues/2042#issuecomment-1054694178

Mar 01 '22 17:03 LittleLittleCloud

@LittleLittleCloud I wonder if it might be related to this dotnet/machinelearning#6087

Mar 08 '22 19:03 luisquintanilla

Turns out the probability only misses when trainer is RandomForest, it might be because of missing calibrator at the end of pipeline. I'll push a fix for it.

Mar 08 '22 21:03 LittleLittleCloud

Just take a closer look and it seems that although according to document, fastForest has probability as output column, but in source code, it doesn't...

@michaelgsharp Could it be a bug in mlnet?

Mar 08 '22 22:03 LittleLittleCloud

This probably is a bug we will need to investigate. Can you create an issue in the ml.net repo for it?

Mar 08 '22 22:03 michaelgsharp

I'm going to create an issue about Probability column missing for fast forest in mlnet, as for now @andrasfuchs what you can do is manually calibrate the result if the best model is fast forest. The recommended calibrator you can use is Platt

IDataView trainData, testData;
ITransformer model;
trainData = model.Transform(trainData);
var platt = context.BinaryClassification.Calibrators.Platt().Fit(trainData);

testData = model.Transform(testData); // testData doesn't have probability
testData = platt.Transform(testData); // now it has probability!!!

Mar 08 '22 22:03 LittleLittleCloud

@LittleLittleCloud This is very useful. I have a follow-up question. Q: The sample has var platt = context.BinaryClassification.Calibrators.Platt().Fit(**trainData**);

It is just a syntax sample so is it best to use all data, train data, or test data for fitting the calibrator?

Are there any concerns if running the calibrator for partial data? For example, in a biometric application can we run the calibration for a specific person only? I.e. train model with all data, then calibrate for each person separately.

Mar 22 '22 08:03 torronen

is it best to use all data, train data, or test data for fitting the calibrator?

I would suggest using train data only, because there will be a data leakage ( for example, truth/false label #) according to the implementation of platt

Are there any concerns if running the calibrator for partial data?

I think it's fine, But be aware that in that case you are changing its prior distribution from the entire training sample to a partial of it. So as long as your test dataset also matches with that distribution you should be fine.

Mar 22 '22 17:03 LittleLittleCloud

Adding a calibrator to FastForestBinaryTrainer will be a breaking due to the change of base class. Should the documentation be updated to reflect the status and provide steps to calibrate instead?

Oct 29 '22 06:10 FranklinWhale