tpot icon indicating copy to clipboard operation
tpot copied to clipboard

TPOTRegressor returns score of -inf when data increases

Open zhh210 opened this issue 4 years ago • 2 comments

Trying to use TPOTRegressor on my own dataset:

from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.metrics import mean_squared_error
hhsz_ = pd.read_pickle('data_hhsz.pkl')
X_train, X_test, y_train, y_test = train_test_split(hhsz_.head(100000)[[i for i in hhsz_.columns if 'id' not in i and 'biweek' not in i and i !='hh_size']].astype('float').values,hhsz_.head(100000)['hh_size'].astype('float').values,train_size=0.75, test_size=0.25, random_state=42)
tpot = TPOTRegressor(generations=1, population_size=1, verbosity=3, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_hhsz_pipeline.py')

Context of the issue

However, the code works when I use the first 10000 rows of my data but will fail with more data used. The error message doesn't provide any clue why this is happening. I also double checked the data and they are all float with no missing values. Any suggestion where I can start to figure out which part is failing? tpopt==0.11.7 and sklearn==0.24.1

I'm expecting the output to be something like

Generation 1 - Current best internal CV score: -3.3265091734288665
                                                                                              
Best pipeline: RandomForestRegressor(input_matrix, bootstrap=True, max_features=0.7500000000000001, min_samples_leaf=16, min_samples_split=9, n_estimators=100)
-3.1596568188745064

Current result

Generation 1 - Current best internal CV score: -inf
Traceback (most recent call last):                        
  File "/home/hadoop/zhan/zhlib/anaconda3/envs/python37/lib/python3.7/site-packages/tpot/base.py", line 828, in fit
    log_file=self.log_file_,
  File "/home/hadoop/zhan/zhlib/anaconda3/envs/python37/lib/python3.7/site-packages/tpot/gp_deap.py", line 281, in eaMuPlusLambda
    per_generation_function(gen)
  File "/home/hadoop/zhan/zhlib/anaconda3/envs/python37/lib/python3.7/site-packages/tpot/base.py", line 1176, in _check_periodic_pipeline
    self._update_top_pipeline()
  File "/home/hadoop/zhan/zhlib/anaconda3/envs/python37/lib/python3.7/site-packages/tpot/base.py", line 935, in _update_top_pipeline
    "There was an error in the TPOT optimization "
RuntimeError: There was an error in the TPOT optimization process. This could be because the data was not formatted properly, or because data for a regression problem was provided to the TPOTClassifier object. Please make sure you passed the data to TPOT correctly. If you enabled PyTorch estimators, please check the data requirements in the online documentation: https://epistasislab.github.io/tpot/using/

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run.py", line 8, in <module>
    tpot.fit(X_train, y_train)
  File "/home/hadoop/zhan/zhlib/anaconda3/envs/python37/lib/python3.7/site-packages/tpot/base.py", line 863, in fit
    raise e
  File "/home/hadoop/zhan/zhlib/anaconda3/envs/python37/lib/python3.7/site-packages/tpot/base.py", line 854, in fit
    self._update_top_pipeline()
  File "/home/hadoop/zhan/zhlib/anaconda3/envs/python37/lib/python3.7/site-packages/tpot/base.py", line 935, in _update_top_pipeline
    "There was an error in the TPOT optimization "
RuntimeError: There was an error in the TPOT optimization process. This could be because the data was not formatted properly, or because data for a regression problem was provided to the TPOTClassifier object. Please make sure you passed the data to TPOT correctly. If you enabled PyTorch estimators, please check the data requirements in the online documentation: https://epistasislab.github.io/tpot/using/

zhh210 avatar Feb 26 '21 04:02 zhh210

Hi @zhh210 , would you mind posting your OS (and OS version) as well as the amount of RAM your computer has?

I'm also wondering whether the transformations you are applying in the line with train_test_split could be doing something unexpected to your dataset. It would be helpful if you could save X_train, X_test, y_train, and y_test to a file and share them.

JDRomano2 avatar Mar 05 '21 16:03 JDRomano2

@JDRomano2 sorry for late response, GitHub muted notification somehow. The os I was running is an aws SageMaker instance of ml.m5.24xlarge which has 96 vCPUs and 384G memory. The file has 2.5M rows with 400M disk size. I didn't see there's any issue with the data but seems tpot will fail when the number of rows exceeds certain amount:

row 1-100,000: success row 100,000-200,000: success row 1-200,000: fail

I tested a TPOTClassifier with similar data size without issues. Somehow the self._optimized_pipeline is None for TPOTRegressor when running on larger dataset.

zhh210 avatar Apr 07 '21 18:04 zhh210