tpot
tpot copied to clipboard
Error while fitting
I used TPOTRegressor on my dataset, adding and removing features from the input data for different tests. When using all 18 features of my 28 datapoints and sample_weight, TPOT fails to fit with a ValueError. This doesn't happen when removing the sample_weight.
The error also doesn't happen in the same dataset using, for example, only 10 features of those 18, or in a different dataset with 8 features and 55 data points.
Process to reproduce the issue
I'm afraid i cannot share the data. This is a mockup of the code used:
import pandas as pd
import tpot
# load data
train_x: pd.DataFrame (28, 18)
train_y: pd.Series (28,)
train_weight: pd.Series (28,)
model= tpot.TPOTRegressor(generations=50, population_size=20, cv=5, random_state=42, verbosity=2)
model.fit(features=train_x, target=train_y, sample_weight=train_weight)
The same result is obtained when using .values
on the pandas variables.
Yields:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File .\.venv\lib\site-packages\tpot\base.py:816, in TPOTBase.fit(self, features, target, sample_weight, groups)
815 warnings.simplefilter("ignore")
--> 816 self._pop, _ = eaMuPlusLambda(
817 population=self._pop,
818 toolbox=self._toolbox,
819 mu=self.population_size,
820 lambda_=self._lambda,
821 cxpb=self.crossover_rate,
822 mutpb=self.mutation_rate,
823 ngen=self.generations,
824 pbar=self._pbar,
825 halloffame=self._pareto_front,
826 verbose=self.verbosity,
827 per_generation_function=self._check_periodic_pipeline,
828 log_file=self.log_file_,
829 )
831 # Allow for certain exceptions to signal a premature fit() cancellation
File .\.venv\lib\site-packages\tpot\gp_deap.py:228, in eaMuPlusLambda(population, toolbox, mu, lambda_, cxpb, mutpb, ngen, pbar, stats, halloffame, verbose, per_generation_function, log_file)
226 initialize_stats_dict(ind)
--> 228 population[:] = toolbox.evaluate(population)
230 record = stats.compile(population) if stats is not None else {}
File .\.venv\lib\site-packages\tpot\base.py:1531, in TPOTBase._evaluate_individuals(self, population, features, target, sample_weight, groups)
1530 self._stop_by_max_time_mins()
-> 1531 val = partial_wrapped_cross_val_score(
1532 sklearn_pipeline=sklearn_pipeline
1533 )
1534 result_score_list = self._update_val(val, result_score_list)
File .\.venv\lib\site-packages\stopit\utils.py:145, in base_timeoutable.__call__..wrapper(*args, **kwargs)
144 # ``result`` may not be assigned below in case of timeout
--> 145 result = func(*args, **kwargs)
146 return result
File .\.venv\lib\site-packages\tpot\gp_deap.py:416, in _wrapped_cross_val_score(sklearn_pipeline, features, target, cv, scoring_function, sample_weight, groups, use_dask)
393 """Fit estimator and compute scores for a given dataset split.
394
395 Parameters
(...)
414 Whether to use dask
415 """
--> 416 sample_weight_dict = set_sample_weight(sklearn_pipeline.steps, sample_weight)
418 features, target, groups = indexable(features, target, groups)
File .\.venv\lib\site-packages\tpot\operator_utils.py:111, in set_sample_weight(pipeline_steps, sample_weight)
110 for (pname, obj) in pipeline_steps:
--> 111 if inspect.getargspec(obj.fit).args.count("sample_weight"):
112 step_sw = pname + "__sample_weight"
File ~\AppData\Local\Programs\Python\Python310\lib\inspect.py:1245, in getargspec(func)
1244 if kwonlyargs or ann:
-> 1245 raise ValueError("Function has keyword-only parameters or annotations"
1246 ", use inspect.signature() API which can support them")
1247 return ArgSpec(args, varargs, varkw, defaults)
ValueError: Function has keyword-only parameters or annotations, use inspect.signature() API which can support them
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
Cell In[58], line 4
1 import tpot
3 tmp = tpot.TPOTRegressor(generations=50, population_size=20, cv=5, random_state=seed, verbosity=2)
----> 4 tmp.fit(features=train_x.values, target=train_y.values, sample_weight=train_weight.values)
5 # tpot_train_y = tmp.predict(train_x)
6 # tpot_test_y = tmp.predict(test_x)
File .\.venv\lib\site-packages\tpot\base.py:863, in TPOTBase.fit(self, features, target, sample_weight, groups)
860 except (KeyboardInterrupt, SystemExit, Exception) as e:
861 # raise the exception if it's our last attempt
862 if attempt == (attempts - 1):
--> 863 raise e
864 return self
File .\.venv\lib\site-packages\tpot\base.py:854, in TPOTBase.fit(self, features, target, sample_weight, groups)
851 if not isinstance(self._pbar, type(None)):
852 self._pbar.close()
--> 854 self._update_top_pipeline()
855 self._summary_of_best_pipeline(features, target)
856 # Delete the temporary cache before exiting
File .\.venv\lib\site-packages\tpot\base.py:961, in TPOTBase._update_top_pipeline(self)
957 self._last_optimized_pareto_front_n_gens = 0
958 else:
959 # If user passes CTRL+C in initial generation, self._pareto_front (halloffame) shoule be not updated yet.
960 # need raise RuntimeError because no pipeline has been optimized
--> 961 raise RuntimeError(
962 "A pipeline has not yet been optimized. Please call fit() first."
963 )
RuntimeError: A pipeline has not yet been optimized. Please call fit() first.
Expected result
Without using sample_weight:
Generation 1 - Current best internal CV score: -0.10226660695789169
Generation 2 - Current best internal CV score: -0.10226660695789169
Generation 3 - Current best internal CV score: -0.08510081133846376
...
Generation 50 - Current best internal CV score: -0.07952325321214902
Best pipeline: AdaBoostRegressor(Nystroem(ExtraTreesRegressor(PolynomialFeatures(input_matrix, degree=2, include_bias=False, interaction_only=False), bootstrap=False, max_features=0.05, min_samples_leaf=5, min_samples_split=12, n_estimators=100), gamma=0.75, kernel=polynomial, n_components=10), learning_rate=0.01, loss=linear, n_estimators=100)
Environment
OS: Windows 10 Python 3.10.5 TPOT==0.11.7 pandas==1.5.3 numpy==1.24.2