tpot
tpot copied to clipboard
My dataet crashed TOP-NN
Hey,
I am getting a crash in TOP-NN. My envirionment is:
>python tpot-NN-rocket-classify.py
Operating system version.... Windows-10-10.0.22000-SP0
Python version is........... 3.8.13
pandas version is........... 1.4.2
numpy version is............ 1.21.5
tpot version is............. 0.11.7
I have put my code and dataset at: https://github.com/CBrauer/TPOT-NN-bug
The program is as follows:
import warnings
warnings.filterwarnings("ignore")
import platform
import sys
import pandas as pd
import numpy as np
import time
from IPython.core.display import HTML, display
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 11)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
import tpot
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle
class Timer:
def __init__(self):
self.start = time.time()
def restart(self):
self.start = time.time()
def get_time(self):
end = time.time()
m, s = divmod(end - self.start, 60)
h, m = divmod(m, 60)
time_str = "%02d:%02d:%02d" % (h, m, s)
return time_str
def LoadData():
df = pd.read_csv('rocket.csv')
response_column = ['Altitude']
feature_columns = ['BoxRatio', 'Thrust', 'Acceleration', 'Velocity', 'OnBalRun', 'vwapGain', 'Expect', 'Trin']
header = feature_columns + response_column
df_describe = df[feature_columns].describe(include='all')
display(df_describe)
X = df[feature_columns].values
y = df[response_column].values.ravel()
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size = 0.2,
random_state = 7)
print('Size of dataset:')
print(' train shape... ', X_train.shape, y_train.shape)
print(' test shape.... ', X_test.shape, y_test.shape)
return X_train, y_train, X_test, y_test
def Main(g, p):
X_train, y_train, X_test, y_test = LoadData()
clf = TPOTClassifier(config_dict='TPOT NN',
template='Selector-Transformer-PytorchLRClassifier',
verbosity=2,
generations=g,
population_size=p,
random_state=7)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
clf.export('tpot_nn_demo_pipeline.py')
if __name__ == "__main__":
print('Operating system version....', platform.platform())
print("Python version is........... %s.%s.%s" % sys.version_info[:3])
print('pandas version is...........', pd.__version__)
print('numpy version is............', np.__version__)
print('tpot version is.............', tpot.__version__)
my_timer = Timer()
Main(10, 10)
elapsed = my_timer.get_time()
print("\nTotal compute time was: %s" % elapsed)
After running a while, I get the following stack trace
Generation 1 - Current best internal CV score: -inf
Optimization Progress: 2%|█▌ | 200/10100 [05 Traceback (most recent call last):
File "C:\anaconda3\lib\site-packages\tpot\base.py", line 816, in fit
self._pop, _ = eaMuPlusLambda(
File "C:\anaconda3\lib\site-packages\tpot\gp_deap.py", line 281, in eaMuPlusLambda
per_generation_function(gen)
File "C:\anaconda3\lib\site-packages\tpot\base.py", line 1176, in _check_periodic_pipeline
self._update_top_pipeline()
File "C:\anaconda3\lib\site-packages\tpot\base.py", line 924, in _update_top_pipeline
cv_scores = cross_val_score(
File "C:\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 509, in cross_val_score
cv_results = cross_validate(
File "C:\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 267, in cross_validate
results = parallel(
File "C:\anaconda3\lib\site-packages\joblib\parallel.py", line 1043, in __call__
if self.dispatch_one_batch(iterator):
File "C:\anaconda3\lib\site-packages\joblib\parallel.py", line 861, in dispatch_one_batch
self._dispatch(tasks)
File "C:\anaconda3\lib\site-packages\joblib\parallel.py", line 779, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "C:\anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "C:\anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 572, in __init__
self.results = batch()
File "C:\anaconda3\lib\site-packages\joblib\parallel.py", line 262, in __call__
return [func(*args, **kwargs)
File "C:\anaconda3\lib\site-packages\joblib\parallel.py", line 262, in <listcomp>
return [func(*args, **kwargs)
File "C:\anaconda3\lib\site-packages\sklearn\utils\fixes.py", line 216, in __call__
return self.function(*args, **kwargs)
File "C:\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\anaconda3\lib\site-packages\sklearn\pipeline.py", line 390, in fit
Xt = self._fit(X, y, **fit_params_steps)
File "C:\anaconda3\lib\site-packages\sklearn\pipeline.py", line 348, in _fit
X, fitted_transformer = fit_transform_one_cached(
File "C:\anaconda3\lib\site-packages\joblib\memory.py", line 349, in __call__
return self.func(*args, **kwargs)
File "C:\anaconda3\lib\site-packages\sklearn\pipeline.py", line 893, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "C:\anaconda3\lib\site-packages\sklearn\base.py", line 855, in fit_transform
return self.fit(X, y, **fit_params).transform(X)
File "C:\anaconda3\lib\site-packages\sklearn\preprocessing\_data.py", line 806, in fit
return self.partial_fit(X, y, sample_weight)
File "C:\anaconda3\lib\site-packages\sklearn\preprocessing\_data.py", line 841, in partial_fit
X = self._validate_data(
File "C:\anaconda3\lib\site-packages\sklearn\base.py", line 566, in _validate_data
X = check_array(X, **check_params)
File "C:\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 814, in check_array
raise ValueError(
ValueError: Found array with 0 feature(s) (shape=(40, 0)) while a minimum of 1 is required by StandardScaler.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "tpot-NN-rocket-classify.py", line 83, in <module>
Main(100, 100)
File "tpot-NN-rocket-classify.py", line 69, in Main
clf.fit(X_train, y_train)
File "C:\anaconda3\lib\site-packages\tpot\base.py", line 863, in fit
raise e
File "C:\anaconda3\lib\site-packages\tpot\base.py", line 854, in fit
self._update_top_pipeline()
File "C:\anaconda3\lib\site-packages\tpot\base.py", line 924, in _update_top_pipeline
cv_scores = cross_val_score(
File "C:\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 509, in cross_val_score
cv_results = cross_validate(
File "C:\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 267, in cross_validate
results = parallel(
File "C:\anaconda3\lib\site-packages\joblib\parallel.py", line 1043, in __call__
if self.dispatch_one_batch(iterator):
File "C:\anaconda3\lib\site-packages\joblib\parallel.py", line 861, in dispatch_one_batch
self._dispatch(tasks)
File "C:\anaconda3\lib\site-packages\joblib\parallel.py", line 779, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "C:\anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "C:\anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 572, in __init__
self.results = batch()
File "C:\anaconda3\lib\site-packages\joblib\parallel.py", line 262, in __call__
return [func(*args, **kwargs)
File "C:\anaconda3\lib\site-packages\joblib\parallel.py", line 262, in <listcomp>
return [func(*args, **kwargs)
File "C:\anaconda3\lib\site-packages\sklearn\utils\fixes.py", line 216, in __call__
return self.function(*args, **kwargs)
File "C:\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\anaconda3\lib\site-packages\sklearn\pipeline.py", line 390, in fit
Xt = self._fit(X, y, **fit_params_steps)
File "C:\anaconda3\lib\site-packages\sklearn\pipeline.py", line 348, in _fit
X, fitted_transformer = fit_transform_one_cached(
File "C:\anaconda3\lib\site-packages\joblib\memory.py", line 349, in __call__
return self.func(*args, **kwargs)
File "C:\anaconda3\lib\site-packages\sklearn\pipeline.py", line 893, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "C:\anaconda3\lib\site-packages\sklearn\base.py", line 855, in fit_transform
return self.fit(X, y, **fit_params).transform(X)
File "C:\anaconda3\lib\site-packages\sklearn\preprocessing\_data.py", line 806, in fit
return self.partial_fit(X, y, sample_weight)
File "C:\anaconda3\lib\site-packages\sklearn\preprocessing\_data.py", line 841, in partial_fit
X = self._validate_data(
File "C:\anaconda3\lib\site-packages\sklearn\base.py", line 566, in _validate_data
X = check_array(X, **check_params)
File "C:\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 814, in check_array
raise ValueError(
ValueError: Found array with 0 feature(s) (shape=(40, 0)) while a minimum of 1 is required by StandardScaler.
H:\HedgeTools\ML_Model_Generation\TPOT>pause
Press any key to continue . . .
I hope you guys can help me with this problem
Charles
Do you see the same issue with non-NN TPOT? E.g., if you omit config_dict='TPOT NN'
?
OK, Is this what you wanted?
def Main(g, p):
X_train, y_train, X_test, y_test = LoadData()
# clf = TPOTClassifier(config_dict='TPOT NN',
clf = TPOTClassifier(template='Selector-Transformer-PytorchLRClassifier',
verbosity=2,
generations=g,
population_size=p,
random_state=7)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
clf.export('tpot_nn_demo_pipeline.py')
Now I get:
H:\HedgeTools\ML_Model_Generation\TPOT-NN>python tpot-NN-rocket-classify.py
Operating system version.... Windows-10-10.0.22000-SP0
Python version is........... 3.8.13
pandas version is........... 1.4.2
numpy version is............ 1.21.5
tpot version is............. 0.11.7
BoxRatio Thrust Acceleration Velocity OnBalRun vwapGain Expect Trin
count 60000.000000 60000.000000 60000.000000 60000.000000 60000.000000 60000.000000 60000.000000 60000.000000
mean 2.061707 1.677448 1.935544 0.635225 2.412940 0.984372 -3.026383 0.834455
std 4.491026 3.056146 1.956287 0.658155 1.602910 0.932878 10.023122 0.284409
min 0.034120 0.000383 0.000112 0.000839 0.048550 0.100003 -50.341116 0.280000
25% 0.344533 0.228764 0.566531 0.155102 1.463102 0.379476 -6.661925 0.600000
50% 0.693704 0.713193 1.606062 0.460673 2.086361 0.730599 -2.334339 0.800000
75% 1.619198 1.790019 2.705824 0.903497 2.905308 1.275189 1.273494 1.040000
max 74.699990 40.539430 27.995832 7.809622 22.693728 11.762206 51.561442 4.540000
Size of dataset:
train shape... (48000, 8) (48000,)
test shape.... (12000, 8) (12000,)
Traceback (most recent call last):
File "C:\anaconda3\lib\site-packages\tpot\base.py", line 496, in _add_operators
operator = next(
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "tpot-NN-rocket-classify.py", line 84, in <module>
Main(10, 10)
File "tpot-NN-rocket-classify.py", line 70, in Main
clf.fit(X_train, y_train)
File "C:\anaconda3\lib\site-packages\tpot\base.py", line 725, in fit
self._fit_init()
File "C:\anaconda3\lib\site-packages\tpot\base.py", line 618, in _fit_init
self._setup_pset()
File "C:\anaconda3\lib\site-packages\tpot\base.py", line 437, in _setup_pset
self._add_operators()
File "C:\anaconda3\lib\site-packages\tpot\base.py", line 500, in _add_operators
raise ValueError(
ValueError: An error occured while attempting to read the specified template. Please check a step named PytorchLRClassifier
H:\HedgeTools\ML_Model_Generation\TPOT-NN>pause
Press any key to continue . . .
I suppose you meant to delete the first two lines.
If I run:
def Main(g, p):
X_train, y_train, X_test, y_test = LoadData()
# clf = TPOTClassifier(config_dict='TPOT NN',
# template='Selector-Transformer-PytorchLRClassifier',
clf = TPOTClassifier(verbosity=2,
generations=g,
population_size=p,
random_state=7)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
clf.export('tpot_nn_demo_pipeline.py')
I get the following results:
H:\HedgeTools\ML_Model_Generation\TPOT-NN>python tpot-NN-rocket-classify.py
Operating system version.... Windows-10-10.0.22000-SP0
Python version is........... 3.8.13
pandas version is........... 1.4.2
numpy version is............ 1.21.5
tpot version is............. 0.11.7
BoxRatio Thrust Acceleration Velocity OnBalRun vwapGain Expect Trin
count 60000.000000 60000.000000 60000.000000 60000.000000 60000.000000 60000.000000 60000.000000 60000.000000
mean 2.061707 1.677448 1.935544 0.635225 2.412940 0.984372 -3.026383 0.834455
std 4.491026 3.056146 1.956287 0.658155 1.602910 0.932878 10.023122 0.284409
min 0.034120 0.000383 0.000112 0.000839 0.048550 0.100003 -50.341116 0.280000
25% 0.344533 0.228764 0.566531 0.155102 1.463102 0.379476 -6.661925 0.600000
50% 0.693704 0.713193 1.606062 0.460673 2.086361 0.730599 -2.334339 0.800000
75% 1.619198 1.790019 2.705824 0.903497 2.905308 1.275189 1.273494 1.040000
max 74.699990 40.539430 27.995832 7.809622 22.693728 11.762206 51.561442 4.540000
Size of dataset:
train shape... (48000, 8) (48000,)
test shape.... (12000, 8) (12000,)
Generation 1 - Current best internal CV score: 0.9716458333333333
Generation 2 - Current best internal CV score: 0.9843125
Generation 3 - Current best internal CV score: 0.9847291666666667
Generation 4 - Current best internal CV score: 0.9859375
Generation 5 - Current best internal CV score: 0.9871041666666667
Generation 6 - Current best internal CV score: 0.9876875
Generation 7 - Current best internal CV score: 0.9876875
Generation 8 - Current best internal CV score: 0.9892291666666667
Generation 9 - Current best internal CV score: 0.9909791666666667
Generation 10 - Current best internal CV score: 0.9909791666666667
Best pipeline: KNeighborsClassifier(DecisionTreeClassifier(RandomForestClassifier(RFE(CombineDFs(input_matrix, input_matrix), criterion=gini, max_features=0.6500000000000001, n_estimators=100, step=0.1), bootstrap=False, criterion=gini, max_features=0.1, min_samples_leaf=3, min_samples_split=20, n_estimators=100), criterion=gini, max_depth=2, min_samples_leaf=9, min_samples_split=9), n_neighbors=6, p=2, weights=distance)
0.9926666666666667
Total compute time was: 01:23:05
I've never had good results with neural networks anyway. And yes, I've tried TabNet. TPOT beats TabNet every time. Charles
It seems to be an issue when templates are used in conjunction with config_dict='TPOT NN'
. When I run your code without a template it runs fine, and the error persists when I swap out your data for a different dataset.
I'll need to do some digging to figure out exactly what is going on, but there seem to be 2 possible contributing factors:
- The feature selector step returning an empty feature set
- A bug in calling
assert_all_finite()
with two arguments instead of one at: https://github.com/EpistasisLab/tpot/blob/6448bdb71ba08b4a0447c640d2f05a05e1affc21/tpot/builtins/nn.py#L163
Hey,
Thanks for the update.
Charles
From: Joe Romano @.> Sent: Saturday, April 30, 2022 4:46 PM To: EpistasisLab/tpot @.> Cc: Charles Brauer @.>; Author @.> Subject: Re: [EpistasisLab/tpot] My dataet crashed TOP-NN (Issue #1247)
It seems to be an issue when templates are used in conjunction with config_dict='TPOT NN'. When I run your code without a template it runs fine, and the error persists when I swap out your data for a different dataset.
I'll need to do some digging to figure out exactly what is going on, but there seem to be 2 possible contributing factors:
- The feature selector step returning an empty feature set
- A bug in calling assert_all_finite() with two arguments instead of one at: https://github.com/EpistasisLab/tpot/blob/6448bdb71ba08b4a0447c640d2f05a05e1affc21/tpot/builtins/nn.py#L163
— Reply to this email directly, view it on GitHub https://github.com/EpistasisLab/tpot/issues/1247#issuecomment-1114073334 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKBS4REXJNF65LQF7J545TVHXA3BANCNFSM5UTZPZOQ . You are receiving this because you authored the thread. https://github.com/notifications/beacon/AAKBS4XJRAHY6WEP4Y5PUMDVHXA3BA5CNFSM5UTZPZO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOIJTWR5Q.gif Message ID: @.*** @.***> >