machinelearning
machinelearning copied to clipboard
Suggestion: Make TransformInference public
Creating pipelines for big datasets can be complex. Datasets may have hundreds or thousands of columns. Type of columns may vary. Even if the type is text, some may be boolean, some may be categories, short sentences or long paragraphs. Developers would benefit from tools to help build the pipeline based on rules that analyse the contents.
AutoML already has tooling for automating inference. However, TransormInferenceApi is internal to Microsoft.ML. ColumnInference is public, but using it without TransformInference seems difficult.
- Could this TransformInferenceApi be public? https://github.com/dotnet/machinelearning/blob/04dda55ab0902982b16309c8e151f13a53e9366d/src/Microsoft.ML.AutoML/TransformInference/TransformInferenceApi.cs
- Or, is there another way to do transform inference from an application referencing the nuget package?
Because the methods are public but the type is internal, I suppose there might have been some discussion and reason for this choice. However, I could not find from Github so maybe it is possible to reconsider. For now, I will make it and related classes public in a private build. I'll update here if any issues.
Related: TransformInference.Experts could benefit from an extension point. For example, a developer could register additional "transform experts" for specific dataset, or to improved general purpose experts. (In general, there could be more extension points so there would be less need for custom builds)
@JakeRadMSFT @luisquintanilla what do you guys think? With the updates we are planning/proposing to automl how would that affect this?
Current TransformInference may be unnecessarily complex. So maybe it could be best to rewrite transform inference to support the new tuners.
- Support adding and removing "experts"
- Support testing multiple configurations as part of hyperparameter sweeping.
Option A: experts take parameters (e.g. isEnabled, Threshold...)
Option B: user creates multiple pipelines and the index of the pipelines is the parameter to tuner:
pipelinesToExperiment[pipelineSelectionParameter]
Reasons:
- Hundreds or thousands of columns need at least one transform to put them inside Features column. Are there other ways to do it currently? If not, this makes experimentation with different datasets slow. If there are, better docs are needed.
- Pipeline may have big impact, such as using OneHotEncoding could lower performance from 99% => 90% instead of allowing LightGBM to use it's in-built classification method with raw numbers (Titanic dataset)