FEDOT
FEDOT copied to clipboard
enh: Post-Improving Boosting Models
Summary
TLDR:
- Updating initial assumptions with boosting models;
- Adding new evolutionary mutations connected with boosting models;
- Allow the use of boosting model data with nan's.
- Allow to use boosting with GPU.
Motivation
The motivation for refactoring boosting models was (#1155 #1209 #1264):
- Update implementation and split it into a separate class strategy.
- Allow boosting models to use categorical features without encoding.
- To create a basis for implementing fitting with bagging (#1005) like the one used in other popular AutoML frameworks.
Results of testing on OpenML are available here. During the development, subsequent ideas for improvement arose.
Guide-level explanation
1. Updating initial assumptions with boosting models
Add more pipelines that contain boosting models to the initial assumptions. Updates presets by using boosting models. I want to draw your attention to the fact that boosting models have several strategies that can be used for various pipelines and presets. You can find more information about the strategy in the boosting framework documentation.
2. Evolutionary Mutations for Boosting Models
According to parameter updates, adding new mutations for pipelines with boosting models became possible.
Boosting strategy mutation:
- Switch to another strategy method
Using category mutation:
- Switch to
enable_categorical
(default: True
)
Change early_stopping_rounds:
- Increase or decrease
early_stopping_rounds
to change the fitting time in the population
Improving metric of XGBoost model mutation:
- Decrease
max_depth
- Increase
min_child_weight
,gamma
,lambda
Improving robustness for the noise of XGBoost model mutation:
- Decrease
subsample
,colsample_bytree
,colsample_bylevel
,colsample_bynode
for some step. - Switch to use the
dart
strategy method.
Improving metric of LightGBM model mutation:
- Decrease
learning_rate
- Increase
max_bin
,num_iterations
,num_leaves
- Switch to use the
dart
strategy method.
Improving robustness for overfitting of LightGBM model mutation:
- Decrease
max_bin
andnum_leaves
- Increase or decrease
min_data_in_leaf
andmin_sum_hessian_in_leaf
- Use
bagging_fraction
andbagging_freq
- Use
feature_fraction
- Use regularization methods:
lambda_l1
,lambda_l2
,min_gain_to_split
andextra_trees
- Decrease
max_depth
Improving robustness for overfitting of Catboost model mutation:
- Increase or decrease
l2_leaf_reg
,colsample_bylevel
,subsample
- Decrease
max_depth
- Increase or decrease
iterations
3. Allow the use of boosting model data with nan's.
One of the advantages of boosting methods is comparing with nan's in data. Implementing this feature in the current version requires refactoring the preprocessing with filling.
4. Allow to use boosting with GPU.
It is possible to use a GPU to accelerate fitting boosting. Therefore, it would be great to add such an opportunity.
Unresolved Questions
Is it possible to continue #1005? The main idea was to develop a method used in other AutoML frameworks. The main problem that had to be faced and not solved is that the pipeline generation approach differs from other frameworks, parallelization of the learning process of basic models on extraped samples, and embedding such an approach into the composition process since it is pretty time-consuming and resource-intensive. However, it guarantees more stable and accurate models. It would be possible to use this approach after composting; for example, if a boosting model is found in the final pipeline, then try to train it using this method.
P.S.
I also note that adding a weighted model, models from a family based on k-nearest neighbors as a meta-model in an ensemble, will help diversify pipelines for classification and regression.
Also note about not perfect method to detect categorical features.