etna AutoML solutions review

AutoML solutions review

Open martins0n opened this issue 2 years ago • 0 comments

Preposition:

We have multiple models and feature generators to use. We don't know what model or features to choose from. We have limited cpu and time resources. We have only target metric we want to optimize. We have multiple nodes to run on.

Aim:

Library level instrument to find out the best possible forecasting approach in the defined above constraints.

Checklist:

time series support
- Yea/No
- How it works?
distributed
- Yea/No
- How is implemented and what is used for job scheduling-running
- Could it work in detach mode or workers should have permanent connection with master node?
feature engineering/selection
- Yea/No
- How feature space is generated?
- We optimize full global space of chosen features, models, hyperparameters at once, or make it in waterfall maner: choose features with any method and then turning model-parameters.
fault tolerance
- What happenes if some trials are raising errors for example
- Could we restart job from stopped point or we should start optimization from the beginning
as a service
- Can we expose only api and run jobs via requests.

Table:

library	time series support	distributed	feature engineering/selection	fault tolerance	as a service	comments
Flaml	✅notebook 1. regression models with lags(hcrystalball backed) + ts specific models like prophet and sarima. 2. No multiple series support	ray as backend	Feature selection ♾️ issue Hardcoded feature generation ref	`flaml.tune` has parameter `max_failure` but it doesn't work as expected i guess	🚫	1.zero-shot automl: features for metalearner: Dataset,NumberOfInstances,NumberOfFeatures,NumberOfClasses,PercentageOfNumericFeatures 2.Additional algorithms for HPO 3. Predefined `low_cost_init_value` in `search_space()` of models for cost-efficient search.
h2o	🚫 There is no classical ts support. You can work with timeseries only via tabular approach	Spark, local threads	There is no special auto feature engineering in free version.	ref Resume support after failing and caching.	Rest API and h2o flow	1. Predefined grid for hyps search 2. H20 is more about black box service. You can't extend it easily: it uses java for example but it should work the same way whatever the data size: if you have hadoop cluster or spark it wil automatically compute job there -- all infrastructure, datasources, algorithms are implemented already for distributed computing.
evalml	✅ example 1. No multiple series support 2. Arima, prophet and regression models	dask backed or multiprocessing ref	You should preprocess data by hand with special transforms like `TimeSeriesImputer` for example. But after turning start they have `TimeSeriesFeaturizer` for feature generation - it crates lags and select important lags via correlations. Full dag of generation	callbacks for error handling	🚫	DAG of feature generation pipeline
PyCaret	✅ models - predefined pipelines with preprocessing and feature generation 1.based on sktime 2. No multiple series support	via fugue( dask + spark)	you can use flags `setup` to make preprocessing, feature selection	1. There is fitted transform cache 2. It seems there is no parameters for error handling settings	🚫	1.After best model selection you can tune hyperparameters via `model.tune_model`. It seems some models have predefined grid in `tune_grid` attribute but in most cases you have too specify grid via `custom_grid` parameter 2. Beautiful docs via git book and cheatsheet
LightAutoML	🚫	multiprocessing	1. There are predefined pipelines - example 2. You can customize pipelines with feature selection steps or add new features via library classes	There is no special flags or memoization	🚫
MLJAR	🚫	multiprocessing	1. Golden Features with feature selection	1.Progress tracking with artifacts with memoization. You can continue progress after restart if `results_path` specified	🚫	1. Good documentation and README.md 2. 4 Mods of training: Explain, Optuna, Compete, Perform
TPOT	🚫	dask backed	feature construction from predefined configs and advanced feature selection modes	warm_start and memoization - you can continue progress after ctrl+c	🚫	1. Default pipeline configurations 2. Good docs
microsoft/nni	🚫	Multiple ways to distribute tasks	There is some classes for feature selection but they are not integrated in general pipeline, they live alone	You can resume experiment. All artifacts are persist to disk	there is web interface for experiment monitoring	It's more about experiment tracking with multiple extensions like HPO. It could be considered as open-source alternative to wandb. There is no any classical automl dependencies but you can build own automl based on nni infrastructure.
auto-sklearn	🚫	dask backed	BO on predefined feature+model space: start point is determined by knowledge of performance on OpenML dataset	there is no checkpointing or reload	🚫	1. You can add custom pipelines or limit search space on feature model granularity 2. This library more about optimal search algorithms than customisable and extensibility: the team developed SMAC algorithm -- so it should be interesting to read there papers first of all 3. Good docs
autogluon	✅ Multiple series support based on sktime and gluonts models	Ray backed (only for hpo)	In case of TS - no. For tabular data: for numerical -- there is no extra generation or preprocessing, for categorical, datetime types there are extra transformations	`TimeSeriesPredictor` doesn't have any checkpointing or explicit error handlers	🚫	1. Good integration with gluon-ts
FEDOT	✅ multi-target or one-dimensional time-series. It seems there is no multiple series support	multiprocessing or proprietary cloud	Dynamic graph optimisation via genetic algorithms. Much in common with autosktime and TPOT.	Results caching in `history_folder`. But it seems there is no support of fit continuation	🚫	1. DAG and genetic algorithms 2. Bad documentation

Notes

baseline models with heuristics, common models and ensembling at the end

Additional material

https://github.com/hibayesian/awesome-automl-papers

Jul 15 '22 13:07 martins0n

etna etna copied to clipboard

AutoML solutions review

Preposition:

Aim:

Checklist:

Table:

Notes

Additional material

etna
etna copied to clipboard