machinelearning
machinelearning copied to clipboard
Exception Retrieving PFI from AutoML Model
System Information (please complete the following information):
- Windows 11
- ML.NET Version 1.7, AutoML 0.19.0
- .NET Version: .NET Core 3.1
Describe the bug ArgumentNullException while attempting to retrieve PFI The model provided does not have a compatible predictor (Parameter 'lastTransformer')
Stack trace
at Microsoft.ML.Runtime.Contracts.CheckValue[T](IExceptionContext ctx, T val, String paramName, String msg)
at Microsoft.ML.PermutationFeatureImportanceExtensions.PermutationFeatureImportance[TMetric,TResult](IHostEnvironment env, ITransformer model, IDataView data, Func1 resultInitializer, Func
2 evaluationFunc, Func3 deltaFunc, Int32 permutationCount, Boolean useFeatureWeightFilter, Nullable
1 numberOfExamplesToUse)
at Microsoft.ML.PermutationFeatureImportanceExtensions.PermutationFeatureImportance(RegressionCatalog catalog, ITransformer model, IDataView data, String labelColumnName, Boolean useFeatureWeightFilter, Nullable`1 numberOfExamplesToUse, Int32 permutationCount)
To Reproduce
Not sure how best to create a minimal reproducible example including data, but here's the core of what I did (based on https://github.com/dotnet/machinelearning/pull/5934)
MLContext mlContext = new MLContext();
ColumnInferenceResults columnInference = mlContext.Auto().InferColumns(trainDataPath, "Label", groupColumns: false);
TextLoader textLoader = mlContext.Data.CreateTextLoader(columnInference.TextLoaderOptions);
IDataView trainDataView = textLoader.Load(trainDataPath);
IDataView testDataView = textLoader.Load(testDataPath);
IEstimator<ITransformer> preFeaturizer =
mlContext.Transforms.Categorical.OneHotEncoding(
new T().GetOneHotInputColumnNames().Select(_ => new InputOutputColumnPair(_)).ToArray()
);
ColumnInformation columnInformation = columnInference.ColumnInformation;
columnInformation.IgnoredColumnNames.AddIfMissing("Foo");
columnInformation.CategoricalColumnNames.Remove("Foo");
columnInformation.NumericColumnNames.Remove("Foo");
BinaryExperimentSettings experimentSettings = new BinaryExperimentSettings()
{
MaxExperimentTimeInSeconds = experimentTime,
OptimizingMetric = BinaryClassificationMetric.F1Score
};
var experiment = mlContext.Auto().CreateBinaryClassificationExperiment(experimentSettings);
ExperimentResult<BinaryClassificationMetrics> experimentResult = experiment.Execute(trainDataView, columnInformation, preFeaturizer, progress);
RunDetail<BinaryClassificationMetrics> bestRun = experimentResult.BestRun;
// Exception thrown here
var permutationFeatureImportance = mlContext
.Regression
.PermutationFeatureImportance(bestRun.Model, testDataViewWithBestScore, permutationCount: 3);
The LastTransformer
of bestRun.Model
is BinaryPredictionTransformer<Microsoft.ML.Calibrators.CalibratedModelParametersBase<Microsoft.ML.Trainers.lightGbm.LightGbmBinaryModelParameters, ...>>
Expected behavior Retrieve the PFI information.
Hey @ericjohannsen,
This issue was actually fixed just yesterday by pr #6085. You can try building the latest from main and see if that fixes your issue, or you can the latest preview package here, https://pkgs.dev.azure.com/dnceng/public/_packaging/MachineLearning/nuget/v3/index.json.
Closing this issue for now as it should be resolved, but please re-open the issue if it isn't resolved.
Is there any further details on the preview package route or any estimate when a released package might be available? I added that JSON as a NuGet package source, but the latest available version of AutoML and Microsoft.ML from that source match the latest on NuGet and both have this issue as well.
I had hoped to include explainability in a user group talk / workshop I'm giving Thursday on AutoML, but it looks like that might be too optimistic if I wanted folks to be able to follow along.
@michaelgsharp @IntegerMan I was able to reference Microsoft.ML 2.0.0-preview.22115.2 and Microsoft.ML.AutoML 0.20.0-preview.22115.2 using the source on Azure, but still encounter the same problem.
<PackageReference Include="Microsoft.ML" Version="2.0.0-preview.22115.2" />
<PackageReference Include="Microsoft.ML.AutoML" Version="0.20.0-preview.22115.2" />
@michaelgsharp Any word on this?
In the preview version instead of passing in bestRun.Model.LastTransformer
, you should just pass in bestRun.Model
. The PFI code should automatically pull out the transformer that it needs. Can you try doing that and let me know what happens? I'll repoen this issue for now.
We are planning on doing a servicing release which will include this bug fix next week as well.
@michaelgsharp I was already passing in bestRun.Model
var bestModel = bestRun.Model;
var permutationFeatureImportance = mlContext
.Regression
.PermutationFeatureImportance(bestRun.Model, testDataViewWithBestScore, permutationCount: 3);
This issue has been automatically marked no-recent-activity
because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity
.
@ericjohannsen the servicing release is out with the fix for this. Can you try with the latest 1.7.1 and let me know if it solves your issue?
@michaelgsharp No joy with 1.7.1. If it helps:
Let me know what other information might be helpful diagnosing this.
Hmm.. Could you share a couple lines of the data you used so that I can run it on my end and see what I can find?
@michaelgsharp Sure. I changed the headers to obfuscate the domain. Not sure whether it matters, but I'm using cross-validation.
Alright, figured it out. There are 2 things that need to change and both are in the call to PFI.
The first is that you are calling PFI in the Regression
catalog, but you are using a Binary trainer. You need to make sure the catalog you use for PFI and the trainer are the same. In this case, PFI in the binary catalog has a slightly different name.
The second is that PFI re-runs your trainer on the data. This means that the data needs to have the correct columns for the trainer. The easiest way to do this is just to use your pipeline to transform the data you pass into PFI. (Your data may already be correct. I didn't have access to see exactly what your testDataViewWithBestScore
actually is, so you can try it without it to see, but if it doesn't work you will have to transform it).
var permutationFeatureImportance = mlContext
.BinaryClassification
.PermutationFeatureImportanceNonCalibrated(bestRun.Model, bestRun.Model.Transform(testDataViewWithBestScore), permutationCount: 3);
```
Try that out and let me know how it goes.
@michaelgsharp Still seeing the issue. I've attached a minimal reproducible example MlMre.zip .
Alright @ericjohannsen. Good news is that I have figured out the issue and why I had it working for me before and how to workaround it in your case. Bad news is that it means there is another bug in how we are extracting the transformer.
In my original testing I didn't have the pre-featurizer that you do because it had a Type T and I didn't know what it was so I just skipped it. This caused bestRun.Model
to return a single transformer chain of normal transformers. In your code and the new example you sent there is a pre-featurizer. AutoML is concatenating the pre-featurizer pipeline with the model they generate. This causes bestRun.Model
to return a transformer chain but this time the last "transformer" in the chain is another transformer chain instead of a normal transformer. We aren't doing recursive checking to find the transformer (100% my mistake as I didn't even think of this scenario), so its missing the transformer. I'll get this fixed.
In your case here is how to workaround it.
First remove the extra transform call on line 75. You can directly use testDataViewWithBestScores
and don't need to do this
var transformedBestScore = bestRun.Model.Transform(testDataViewWithBestScore);
.
Finally, we need to extract the transformer chain that is nested inside bestRun.Model. If you use this cast it will get it correctly. (bestRun.Model as TransformerChain<ITransformer>).LastTransformer
All in all the final lines should look like this:
// Evaluate test data
IDataView testDataViewWithBestScore = bestRun.Model.Transform(testDataView);
// Feature Importance (PFI) https://github.com/dotnet/machinelearning/issues/6084
Console.WriteLine("Get most important features");
var permutationFeatureImportance = mlContext
.BinaryClassification
.PermutationFeatureImportanceNonCalibrated((bestRun.Model as TransformerChain<ITransformer>).LastTransformer, testDataViewWithBestScore, permutationCount: 3);
Can you try this and make sure its working for you? In this example PFI should run pretty quick, but with lots of features it does take some time to run so just be prepared for that.
Success!
@michaelgsharp Thank you for sticking with this, and for the detailed explanation.
Side note: The PFI calculation took quite a while on my Ryzen 9 3900 system, and I noticed only one CPU core is utilized.
Only 1 core is used? I haven't looked too much into how we are actually doing the calculations, but it seems like there is potential for some serious optimizations there.
I'm going to rename this issue to track fixing the recursive parsing issue for PFI. I'm glad we were able to get this solved!
@michaelgsharp are we okay to close this issue?