machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

Probability is missing for FastForest binary classifier

Open LittleLittleCloud opened this issue 3 years ago • 7 comments

As title

Originally posted by @andrasfuchs in https://github.com/dotnet/machinelearning-modelbuilder/issues/2042#issuecomment-1054694178

LittleLittleCloud avatar Mar 01 '22 17:03 LittleLittleCloud

@LittleLittleCloud I wonder if it might be related to this dotnet/machinelearning#6087

luisquintanilla avatar Mar 08 '22 19:03 luisquintanilla

Turns out the probability only misses when trainer is RandomForest, it might be because of missing calibrator at the end of pipeline. I'll push a fix for it.

LittleLittleCloud avatar Mar 08 '22 21:03 LittleLittleCloud

Just take a closer look and it seems that although according to document, fastForest has probability as output column, but in source code, it doesn't...

@michaelgsharp Could it be a bug in mlnet?

LittleLittleCloud avatar Mar 08 '22 22:03 LittleLittleCloud

This probably is a bug we will need to investigate. Can you create an issue in the ml.net repo for it?

michaelgsharp avatar Mar 08 '22 22:03 michaelgsharp

I'm going to create an issue about Probability column missing for fast forest in mlnet, as for now @andrasfuchs what you can do is manually calibrate the result if the best model is fast forest. The recommended calibrator you can use is Platt

IDataView trainData, testData;
ITransformer model;
trainData = model.Transform(trainData);
var platt = context.BinaryClassification.Calibrators.Platt().Fit(trainData);

testData = model.Transform(testData); // testData doesn't have probability
testData = platt.Transform(testData); // now it has probability!!!

LittleLittleCloud avatar Mar 08 '22 22:03 LittleLittleCloud

@LittleLittleCloud This is very useful. I have a follow-up question. Q: The sample has var platt = context.BinaryClassification.Calibrators.Platt().Fit(**trainData**);

It is just a syntax sample so is it best to use all data, train data, or test data for fitting the calibrator?

Are there any concerns if running the calibrator for partial data? For example, in a biometric application can we run the calibration for a specific person only? I.e. train model with all data, then calibrate for each person separately.

torronen avatar Mar 22 '22 08:03 torronen

is it best to use all data, train data, or test data for fitting the calibrator?

I would suggest using train data only, because there will be a data leakage ( for example, truth/false label #) according to the implementation of platt

Are there any concerns if running the calibrator for partial data?

I think it's fine, But be aware that in that case you are changing its prior distribution from the entire training sample to a partial of it. So as long as your test dataset also matches with that distribution you should be fine.

LittleLittleCloud avatar Mar 22 '22 17:03 LittleLittleCloud

Adding a calibrator to FastForestBinaryTrainer will be a breaking due to the change of base class. Should the documentation be updated to reflect the status and provide steps to calibrate instead?

FranklinWhale avatar Oct 29 '22 06:10 FranklinWhale