autogluon icon indicating copy to clipboard operation
autogluon copied to clipboard

[BUG] TabularPredictor fails with fit_extra for regression

Open navinp1912 opened this issue 2 years ago • 3 comments

  • [✓ ] I have checked that this bug exists on the latest stable version of AutoGluon
  • [ ✓] and/or I have checked that this bug exists on the latest mainline of AutoGluon via source installation

Describe the bug Tabular predictor with fit_extra for regression fails with TypeError: '>' not supported between instances of 'float' and 'NoneType' I've put the complete script to reproduce and the complete log. predictor.fit_extra(custom_hyperparameters,time_limit=120)

To mitigate i tried adding the below code to abstract_trainer.py and it worked but i'm not sure what the right fix is . The idea if any of the models are not initialized due to time_limit or other reasons, we cannot compare it. Only when best_score and cur_score are initialized and valid we compare them otherwise skip the current model. This error message seems to have occurred previously for high_quality preset and there was a fix for this 1 or 2 years back but the changes are completely different.

diff --git a/core/src/autogluon/core/trainer/abstract_trainer.py b/core/src/autogluon/core/trainer/abstract_trainer.py
index f8d55745..38576557 100644
--- a/core/src/autogluon/core/trainer/abstract_trainer.py
+++ b/core/src/autogluon/core/trainer/abstract_trainer.py
@@ -1070,6 +1070,10 @@ class AbstractTrainer:
                 else:
                     best_score = self.get_model_attribute(self.model_best, 'val_score')
                     cur_score = self.get_model_attribute(weighted_ensemble_model_name, 'val_score')
+                    if best_score is None:
+                        continue
+                    if cur_score is None:
+                        continue
                     if cur_score > best_score:
                         # new best model
                         self.model_best = weighted_ensemble_model_name

Expected behavior fit_extra shouldn't crash

To Reproduce

from autogluon.tabular import TabularDataset, TabularPredictor
from random import uniform

train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')  # can be local CSV file as well, returns Pandas DataFrame
label = 'class'

# remove the classification label
train_data.drop(columns=['class'])

#add the regression labels containing  0 - 100 as floats with 2 decimal places
train_data['class'] = [round(uniform(0, 100), 2) for i in range(len(train_data))]

from autogluon.tabular.configs.hyperparameter_configs import get_hyperparameter_config
save_path = 'agModels11'  # specifies folder to store trained models
custom_hyperparameters = get_hyperparameter_config('default')

#works fine
predictor = TabularPredictor(label=label, path=save_path).fit(train_data,presets='high_quality',time_limit=110,hyperparameters=custom_hyperparameters)

#load the saved model for further training
predictor = TabularPredictor.load(save_path)

#fails with 
# ... lot of stack
#  File "/home/buggluon/autogluon/core/src/autogluon/core/trainer/abstract_trainer.py", line 525, in stack_new_level_aux
#    return self.generate_weighted_ensemble(X=X_stack_preds, y=y,
#  File "/home/buggluon/autogluon/core/src/autogluon/core/trainer/abstract_trainer.py", line 1073, in generate_weighted_ensemble
#    if cur_score > best_score:
#TypeError: '>' not supported between instances of 'float' and 'NoneType'

predictor.fit_extra(custom_hyperparameters,time_limit=120)

Screenshots If applicable, add screenshots to help explain your problem.

Installed Versions Which version of AutoGluon are you are using?
If you are using 0.4.0 and newer, please run the following code snippet:

# Replace this code with the output of the following:
from autogluon.core.utils import show_versions
show_versions()

INSTALLED VERSIONS
------------------
date                   : 2022-07-28
time                   : 13:41:38.419504
python                 : 3.8.10.final.0
OS                     : Linux
OS-release             : 5.4.0-88-generic
Version                : #99-Ubuntu SMP Thu Sep 23 17:29:00 UTC 2021
machine                : x86_64
processor              : x86_64
num_cores              : 4
cpu_ram_mb             : 7632
cuda version           : None
num_gpus               : 0
gpu_ram_mb             : []
avail_disk_size_mb     : 24959

autogluon.common       : 0.5.2b20220726
autogluon.core         : 0.5.2b20220726
autogluon.features     : 0.5.2b20220726
autogluon.multimodal   : 0.5.2b20220726
autogluon.tabular      : 0.5.2b20220726
autogluon.text         : 0.5.2b20220726
autogluon.timeseries   : None
autogluon.vision       : 0.5.2b20220726
autogluon_contrib_nlp  : None
boto3                  : 1.24.37
catboost               : 1.0.6
dask                   : 2021.11.2
distributed            : 2021.11.2
fairscale              : 0.4.7
fastai                 : 2.7.7
gluoncv                : 0.11.0
hyperopt               : 0.2.7
lightgbm               : 3.3.2
matplotlib             : 3.1.2
networkx               : 2.4
nlpaug                 : 1.1.10
nltk                   : 3.7
nptyping               : 1.4.4
numpy                  : 1.21.6
omegaconf              : 2.1.2
pandas                 : 1.4.3
PIL                    : 9.0.1
protobuf               : None
psutil                 : 5.8.0
pytorch-metric-learning: None
pytorch_lightning      : 1.6.5
ray                    : 1.13.0
requests               : 2.22.0
scipy                  : 1.7.3
sentencepiece          : None
skimage                : 0.19.3
sklearn                : 1.0.2
smart_open             : 5.2.1
timm                   : 0.5.4
torch                  : 1.12.0+cu113
torchmetrics           : 0.7.3
torchtext              : 0.13.0
torchvision            : 0.13.0+cu113
tqdm                   : 4.64.0
transformers           : 4.20.1
xgboost                : 1.4.2

Additional context Complete Log message

Presets specified: ['high_quality']
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=20
Beginning AutoGluon training ... Time limit = 110s
AutoGluon will save models to "agModels11/"
AutoGluon Version:  0.5.2b20220726
Python Version:     3.8.10
Operating System:   Linux
Train Data Rows:    39073
Train Data Columns: 14
Label Column: class
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
        Label info (max, min, mean, stddev): (100.0, 0.0, 49.95621, 28.80265)
        If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
        Available Memory:                    3104.94 MB
        Train Data (Original)  Memory Usage: 22.92 MB (0.7% of available memory)
        Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
        Stage 1 Generators:
                Fitting AsTypeFeatureGenerator...
                        Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
        Stage 2 Generators:
                Fitting FillNaFeatureGenerator...
        Stage 3 Generators:
                Fitting IdentityFeatureGenerator...
                Fitting CategoryFeatureGenerator...
                        Fitting CategoryMemoryMinimizeFeatureGenerator...
        Stage 4 Generators:
                Fitting DropUniqueFeatureGenerator...
        Types of features in original data (raw dtype, special dtypes):
                ('int', [])    : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
                ('object', []) : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
        Types of features in processed data (raw dtype, special dtypes):
                ('category', [])  : 7 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
                ('int', [])       : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
                ('int', ['bool']) : 1 | ['sex']
        0.8s = Fit runtime
        14 features in original data used to generate 14 features in processed data.
        Train Data (Processed) Memory Usage: 2.19 MB (0.1% of available memory)
Data preprocessing and feature engineering runtime = 0.86s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
        This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
        To change this, specify the eval_metric parameter of Predictor()
AutoGluon will fit 2 stack levels (L1 to L2) ...
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif_BAG_L1 ... Training model for up to 72.73s of the 109.13s of remaining time.
        -31.5505         = Validation score   (-root_mean_squared_error)
        0.27s    = Training   runtime
        1.55s    = Validation runtime
Fitting model: KNeighborsDist_BAG_L1 ... Training model for up to 70.71s of the 107.1s of remaining time.
        -32.6513         = Validation score   (-root_mean_squared_error)
        0.6s     = Training   runtime
        1.69s    = Validation runtime
Fitting model: LightGBMXT_BAG_L1 ... Training model for up to 68.17s of the 104.56s of remaining time.
        Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
        -28.8001         = Validation score   (-root_mean_squared_error)
        28.94s   = Training   runtime
        0.43s    = Validation runtime
Fitting model: LightGBM_BAG_L1 ... Training model for up to 14.26s of the 50.65s of remaining time.
        Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
        -28.8028         = Validation score   (-root_mean_squared_error)
        37.15s   = Training   runtime
        0.53s    = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L2 ... Training model for up to 109.12s of the -2.92s of remaining time.
        -28.8001         = Validation score   (-root_mean_squared_error)
        4.74s    = Training   runtime
        0.0s     = Validation runtime
Fitting 9 L2 models ...
Completed 1/20 k-fold bagging repeats ...
No base models to train on, skipping auxiliary stack level 3...
AutoGluon training complete, total runtime = 118.14s ... Best model: "WeightedEnsemble_L2"
Fitting model: KNeighborsUnif_BAG_L1_FULL | Skipping fit via cloning parent ...
        0.27s    = Training   runtime
        1.55s    = Validation runtime
Fitting model: KNeighborsDist_BAG_L1_FULL | Skipping fit via cloning parent ...
        0.6s     = Training   runtime
        1.69s    = Validation runtime
Fitting 1 L1 models ...
Fitting model: LightGBMXT_BAG_L1_FULL ...
        5.18s    = Training   runtime
Fitting 1 L1 models ...
Fitting model: LightGBM_BAG_L1_FULL ...
        1.6s     = Training   runtime
Fitting model: WeightedEnsemble_L2_FULL | Skipping fit via cloning parent ...
        4.74s    = Training   runtime
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("agModels11/")
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif_2_BAG_L1 ... Training model for up to 120.0s of the 119.99s of remaining time.
        -31.5505         = Validation score   (-root_mean_squared_error)
        0.74s    = Training   runtime
        1.82s    = Validation runtime
Fitting model: KNeighborsDist_2_BAG_L1 ... Training model for up to 117.17s of the 117.17s of remaining time.
        -32.6513         = Validation score   (-root_mean_squared_error)
        0.69s    = Training   runtime
        1.64s    = Validation runtime
Fitting model: LightGBMXT_2_BAG_L1 ... Training model for up to 114.61s of the 114.6s of remaining time.
        Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
        -28.8001         = Validation score   (-root_mean_squared_error)
        49.26s   = Training   runtime
        0.75s    = Validation runtime
Fitting model: LightGBM_2_BAG_L1 ... Training model for up to 43.33s of the 43.32s of remaining time.
        Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
        -28.8028         = Validation score   (-root_mean_squared_error)
        47.47s   = Training   runtime
        0.6s     = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_2_L2 ... Training model for up to 120.0s of the -27.98s of remaining time.
        -28.8001         = Validation score   (-root_mean_squared_error)
        2.28s    = Training   runtime
        0.0s     = Validation runtime
best_score  None
cur_score  -28.800071683030545
Traceback (most recent call last):
  File "/home/buggluon/autogluon/examples/tabular/nojupiter_predict.py", line 31, in <module>
    predictor.fit_extra(custom_hyperparameters,time_limit=120)
  File "/home/buggluon/autogluon/tabular/src/autogluon/tabular/predictor/predictor.py", line 1092, in fit_extra
    fit_models = self._trainer.train_multi_levels(
  File "/home/buggluon/autogluon/core/src/autogluon/core/trainer/abstract_trainer.py", line 295, in train_multi_levels
    base_model_names, aux_models = self.stack_new_level(
  File "/home/buggluon/autogluon/core/src/autogluon/core/trainer/abstract_trainer.py", line 409, in stack_new_level
    aux_models = self.stack_new_level_aux(X=X, y=y, base_model_names=core_models, level=level+1,
  File "/home/buggluon/autogluon/core/src/autogluon/core/trainer/abstract_trainer.py", line 525, in stack_new_level_aux
    return self.generate_weighted_ensemble(X=X_stack_preds, y=y,
  File "/home/buggluon/autogluon/core/src/autogluon/core/trainer/abstract_trainer.py", line 1075, in generate_weighted_ensemble
    if cur_score > best_score:
TypeError: '>' not supported between instances of 'float' and 'NoneType'

navinp1912 avatar Jul 28 '22 08:07 navinp1912

Thanks for reporting! This is indeed a bug. You can bypass this bug by not using high_quality / refit_full and then calling fit_extra, then you can call refit_full on the final result. (aka use best_quality, then when you are done calling fit_extra you can call predictor.refit_full() to get the same result as if the bug didn't exist.

Will plan to fix this in the next release.

Innixma avatar Jul 28 '22 20:07 Innixma

Thanks for reporting! This is indeed a bug. You can bypass this bug by not using high_quality / refit_full and then calling fit_extra, then you can call refit_full on the final result. (aka use best_quality, then when you are done calling fit_extra you can call predictor.refit_full() to get the same result as if the bug didn't exist.

Will plan to fix this in the next release.

I found that predictor.refit_full() decreases accuracy. I do a fit with time_limit, then fit_extra with time_limit and then a refit_full. Is it true or am i doing something wrong ?

Can you save the state of autogluon after timelimit and resume the training ? Say for example i want to train for 48 hours but i can't run my training for 48 hours continuously. Is there a way like after 24 hours training on friday i can pause the training and save it and them resume it again on Monday ?

If my default hyperparameters and timelimit are same for both fit and fit_extra , then there is duplication of models which wastes training time.

`` After fit

model score_test score_val pred_time_test pred_time_val fit_time pred_time_test_marginal pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
KNeighborsDist_BAG_L1 1.000000 -0.091291 0.398728 0.367611 0.056612 0.398728 0.367611 0.056612 1 True 2
RandomForestMSE_BAG_L1 0.975580 0.824212 3.178744 2.205708 44.445011 3.178744 2.205708 44.445011 1 True 5
ExtraTreesMSE_BAG_L1 0.973658 0.826616 3.296700 2.009215 18.024629 3.296700 2.009215 18.024629 1 True 7

``

results after fit_extra , i have duplicate models

model score_test score_val pred_time_test pred_time_val fit_time pred_time_test_marginal pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
KNeighborsDist_BAG_L1 1.000000 -0.091291 0.445786 0.367611 0.056612 0.445786 0.367611 0.056612 1 True 2
KNeighborsDist_2_BAG_L1 1.000000 -0.091291 0.448205 0.372300 0.058094 0.448205 0.372300 0.058094 1 True 24
RandomForestMSE_2_BAG_L1 0.975580 0.824212 3.970681 1.952887 42.525009 3.970681 1.952887 42.525009 1 True 27
RandomForestMSE_BAG_L1 0.975580 0.824212 5.436502 2.205708 44.445011 5.436502 2.205708 44.445011 1 True 5
ExtraTreesMSE_BAG_L1 0.973658 0.826616 3.205769 2.009215 18.024629 3.205769 2.009215 18.024629 1 True 7
ExtraTreesMSE_2_BAG_L1 0.973658 0.826616 3.509352 2.066019 16.863691 3.509352 2.066019 16.863691 1 True 29

``

navinp0304 avatar Aug 03 '22 08:08 navinp0304

I found that predictor.refit_full() decreases accuracy. I do a fit with time_limit, then fit_extra with time_limit and then a refit_full. Is it true or am i doing something wrong ?

Yes, refit_full reduces accuracy but speeds up inference.

https://auto.gluon.ai/stable/tutorials/tabular_prediction/tabular-quickstart.html#presets

Best = best accuracy High = same as Best, but refit for lower accuracy but faster inference

Can you save the state of autogluon after timelimit and resume the training ? Say for example i want to train for 48 hours but i can't run my training for 48 hours continuously. Is there a way like after 24 hours training on friday i can pause the training and save it and them resume it again on Monday ? If my default hyperparameters and timelimit are same for both fit and fit_extra , then there is duplication of models which wastes training time.

You cannot resume training via the same config you initially used. You will need to investigate which models were trained and adjust your hyperparameters in the fit_extra call. This is not easy to automate for a variety of reasons, especially with time limit involved as AutoGluon dynamically changes its strategy at multiple stages of training based on time limit, and it is hard to recover the state of that. It may be added eventually, but isn't there yet.

Innixma avatar Aug 03 '22 20:08 Innixma

@navinp0304 This bug has been fixed in mainline, and will be available in the upcoming v0.6 release. Thanks again for reporting!

Innixma avatar Nov 08 '22 20:11 Innixma

You cannot resume training via the same config you initially used. You will need to investigate which models were trained and adjust your hyperparameters in the fit_extra call. This is not easy to automate for a variety of reasons, especially with time limit involved as AutoGluon dynamically changes its strategy at multiple stages of training based on time limit, and it is hard to recover the state of that. It may be added eventually, but isn't there yet.

Hello. @Innixma . Is there some guidelines or instructions to help set hyperparameters correctly in fit_extra? For example, I excluded NN_TORCH during fit, and program crashed after finished training the L1 models. How should I continue my L2 training? Thank you very much.

chine007 avatar Aug 01 '23 07:08 chine007