etna icon indicating copy to clipboard operation
etna copied to clipboard

AutoML solutions review

Open martins0n opened this issue 2 years ago • 0 comments

Preposition:

We have multiple models and feature generators to use. We don't know what model or features to choose from. We have limited cpu and time resources. We have only target metric we want to optimize. We have multiple nodes to run on.

Aim:

Library level instrument to find out the best possible forecasting approach in the defined above constraints.

Checklist:

  • time series support
    • Yea/No
    • How it works?
  • distributed
    • Yea/No
    • How is implemented and what is used for job scheduling-running
    • Could it work in detach mode or workers should have permanent connection with master node?
  • feature engineering/selection
    • Yea/No
    • How feature space is generated?
    • We optimize full global space of chosen features, models, hyperparameters at once, or make it in waterfall maner: choose features with any method and then turning model-parameters.
  • fault tolerance
    • What happenes if some trials are raising errors for example
    • Could we restart job from stopped point or we should start optimization from the beginning
  • as a service
    • Can we expose only api and run jobs via requests.

Table:

library time series support distributed feature engineering/selection fault tolerance as a service comments
Flaml notebook
1. regression models with lags(hcrystalball backed) + ts specific models like prophet and sarima. 2. No multiple series support
ray as backend
Feature selection ♾️ issue
Hardcoded feature generation ref
flaml.tune has parameter max_failure but it doesn't work as expected i guess 🚫 1.zero-shot automl: features for metalearner: Dataset,NumberOfInstances,NumberOfFeatures,NumberOfClasses,PercentageOfNumericFeatures
2.Additional algorithms for HPO
3. Predefined low_cost_init_value in search_space() of models for cost-efficient search.
h2o 🚫 There is no classical ts support. You can work with timeseries only via tabular approach Spark, local threads There is no special auto feature engineering in free version. ref Resume support after failing and caching. Rest API and h2o flow
1. Predefined grid for hyps search
2. H20 is more about black box service. You can't extend it easily: it uses java for example but it should work the same way whatever the data size: if you have hadoop cluster or spark it wil automatically compute job there -- all infrastructure, datasources, algorithms are implemented already for distributed computing.
evalml example
1. No multiple series support
2. Arima, prophet and regression models
dask backed or multiprocessing ref You should preprocess data by hand with special transforms like TimeSeriesImputer for example. But after turning start they have TimeSeriesFeaturizer for feature generation - it crates lags and select important lags via correlations. Full dag of generation callbacks for error handling 🚫
DAG of feature generation pipeline
PyCaret models - predefined pipelines with preprocessing and feature generation
1.based on sktime
2. No multiple series support
via fugue( dask + spark) you can use flags setup to make preprocessing, feature selection
1. There is fitted transform cache
2. It seems there is no parameters for error handling settings
🚫
1.After best model selection you can tune hyperparameters via model.tune_model. It seems some models have predefined grid in tune_grid attribute but in most cases you have too specify grid via custom_grid parameter
2. Beautiful docs via git book and cheatsheet
LightAutoML 🚫 multiprocessing
1. There are predefined pipelines - example
2. You can customize pipelines with feature selection steps or add new features via library classes
There is no special flags or memoization 🚫
MLJAR 🚫 multiprocessing 🚫
1. Good documentation and README.md
2. 4 Mods of training: Explain, Optuna, Compete, Perform
TPOT 🚫 dask backed feature construction from predefined configs and advanced feature selection modes warm_start and memoization - you can continue progress after ctrl+c 🚫
2. Good docs
microsoft/nni 🚫 Multiple ways to distribute tasks There is some classes for feature selection but they are not integrated in general pipeline, they live alone You can resume experiment. All artifacts are persist to disk there is web interface for experiment monitoring It's more about experiment tracking with multiple extensions like HPO. It could be considered as open-source alternative to wandb. There is no any classical automl dependencies but you can build own automl based on nni infrastructure.
auto-sklearn 🚫 dask backed BO on predefined feature+model space: start point is determined by knowledge of performance on OpenML dataset there is no checkpointing or reload 🚫
1. You can add custom pipelines or limit search space on feature model granularity
2. This library more about optimal search algorithms than customisable and extensibility: the team developed SMAC algorithm -- so it should be interesting to read there papers first of all
autogluon ✅ Multiple series support based on sktime and gluonts models Ray backed (only for hpo) In case of TS - no. For tabular data: for numerical -- there is no extra generation or preprocessing, for categorical, datetime types there are extra transformations TimeSeriesPredictor doesn't have any checkpointing or explicit error handlers 🚫
1. Good integration with gluon-ts
FEDOT ✅ multi-target or one-dimensional time-series. It seems there is no multiple series support multiprocessing or proprietary cloud Dynamic graph optimisation via genetic algorithms. Much in common with autosktime and TPOT. Results caching in history_folder. But it seems there is no support of fit continuation 🚫
1. DAG and genetic algorithms
2. Bad documentation

Notes

  • baseline models with heuristics, common models and ensembling at the end

Additional material

  • https://github.com/hibayesian/awesome-automl-papers

martins0n avatar Jul 15 '22 13:07 martins0n