etna
etna copied to clipboard
AutoML solutions review
Preposition:
We have multiple models and feature generators to use. We don't know what model or features to choose from. We have limited cpu and time resources. We have only target metric we want to optimize. We have multiple nodes to run on.
Aim:
Library level instrument to find out the best possible forecasting approach in the defined above constraints.
Checklist:
- time series support
- Yea/No
- How it works?
- distributed
- Yea/No
- How is implemented and what is used for job scheduling-running
- Could it work in detach mode or workers should have permanent connection with master node?
- feature engineering/selection
- Yea/No
- How feature space is generated?
- We optimize full global space of chosen features, models, hyperparameters at once, or make it in waterfall maner: choose features with any method and then turning model-parameters.
- fault tolerance
- What happenes if some trials are raising errors for example
- Could we restart job from stopped point or we should start optimization from the beginning
- as a service
- Can we expose only api and run jobs via requests.
Table:
library | time series support | distributed | feature engineering/selection | fault tolerance | as a service | comments |
---|---|---|---|---|---|---|
Flaml | ✅notebook 1. regression models with lags(hcrystalball backed) + ts specific models like prophet and sarima. 2. No multiple series support |
ray as backend | Feature selection ♾️ issue Hardcoded feature generation ref |
flaml.tune has parameter max_failure but it doesn't work as expected i guess |
🚫 | 1.zero-shot automl: features for metalearner: Dataset,NumberOfInstances,NumberOfFeatures,NumberOfClasses,PercentageOfNumericFeatures 2.Additional algorithms for HPO 3. Predefined low_cost_init_value in search_space() of models for cost-efficient search. |
h2o | 🚫 There is no classical ts support. You can work with timeseries only via tabular approach | Spark, local threads | There is no special auto feature engineering in free version. | ref Resume support after failing and caching. | Rest API and h2o flow | 1. Predefined grid for hyps search 2. H20 is more about black box service. You can't extend it easily: it uses java for example but it should work the same way whatever the data size: if you have hadoop cluster or spark it wil automatically compute job there -- all infrastructure, datasources, algorithms are implemented already for distributed computing. |
evalml | ✅ example 1. No multiple series support 2. Arima, prophet and regression models |
dask backed or multiprocessing ref | You should preprocess data by hand with special transforms like TimeSeriesImputer for example. But after turning start they have TimeSeriesFeaturizer for feature generation - it crates lags and select important lags via correlations. Full dag of generation |
callbacks for error handling | 🚫 | DAG of feature generation pipeline |
PyCaret | ✅ models - predefined pipelines with preprocessing and feature generation 1.based on sktime 2. No multiple series support |
via fugue( dask + spark) | you can use flags setup to make preprocessing, feature selection |
1. There is fitted transform cache 2. It seems there is no parameters for error handling settings |
🚫 | 1.After best model selection you can tune hyperparameters via model.tune_model . It seems some models have predefined grid in tune_grid attribute but in most cases you have too specify grid via custom_grid parameter 2. Beautiful docs via git book and cheatsheet |
LightAutoML | 🚫 | multiprocessing | 1. There are predefined pipelines - example 2. You can customize pipelines with feature selection steps or add new features via library classes |
There is no special flags or memoization | 🚫 | |
MLJAR | 🚫 | multiprocessing | 1. Golden Features with feature selection |
🚫 | 1. Good documentation and README.md 2. 4 Mods of training: Explain, Optuna, Compete, Perform |
|
TPOT | 🚫 | dask backed | feature construction from predefined configs and advanced feature selection modes | warm_start and memoization - you can continue progress after ctrl+c | 🚫 | 2. Good docs |
microsoft/nni | 🚫 | Multiple ways to distribute tasks | There is some classes for feature selection but they are not integrated in general pipeline, they live alone | You can resume experiment. All artifacts are persist to disk | there is web interface for experiment monitoring | It's more about experiment tracking with multiple extensions like HPO. It could be considered as open-source alternative to wandb. There is no any classical automl dependencies but you can build own automl based on nni infrastructure. |
auto-sklearn | 🚫 | dask backed | BO on predefined feature+model space: start point is determined by knowledge of performance on OpenML dataset | there is no checkpointing or reload | 🚫 | 1. You can add custom pipelines or limit search space on feature model granularity 2. This library more about optimal search algorithms than customisable and extensibility: the team developed SMAC algorithm -- so it should be interesting to read there papers first of all 3. Good docs |
autogluon | ✅ Multiple series support based on sktime and gluonts models | Ray backed (only for hpo) | In case of TS - no. For tabular data: for numerical -- there is no extra generation or preprocessing, for categorical, datetime types there are extra transformations | TimeSeriesPredictor doesn't have any checkpointing or explicit error handlers |
🚫 | 1. Good integration with gluon-ts |
FEDOT | ✅ multi-target or one-dimensional time-series. It seems there is no multiple series support | multiprocessing or proprietary cloud | Dynamic graph optimisation via genetic algorithms. Much in common with autosktime and TPOT. | Results caching in history_folder . But it seems there is no support of fit continuation |
🚫 | 1. DAG and genetic algorithms 2. Bad documentation |
Notes
- baseline models with heuristics, common models and ensembling at the end
Additional material
- https://github.com/hibayesian/awesome-automl-papers