verstack
verstack copied to clipboard
could not convert string to float: 'x' - using FeatureSelector
When trying to use FeatureSelector I got "" message.
Command I use (python 3.10):
from verstack import FeatureSelector
FS = FeatureSelector(objective = 'classification', auto = True)
selected_feats = FS.fit_transform(X_encoded, y)
Error call stack:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[24], line 3
1 from verstack import FeatureSelector
2 FS = FeatureSelector(objective = 'classification', auto = True)
----> 3 selected_feats = FS.fit_transform(X_encoded, y)
File ~\AppData\Roaming\Python\Python310\site-packages\verstack\tools.py:19, in timer.<locals>.wrapped(*args, **kwargs)
16 @wraps(func)
17 def wrapped(*args, **kwargs):
18 start = time.time()
---> 19 result = func(*args, **kwargs)
20 end = time.time()
21 elapsed = round(end-start,5)
File ~\AppData\Roaming\Python\Python310\site-packages\verstack\FeatureSelector.py:232, in FeatureSelector.fit_transform(self, X, y, **kwargs)
230 if self.auto:
231 self.printer.print(f'Comparing LinearRegression and RandomForest for feature selection', order = 2)
--> 232 self._auto_linear_randomforest_selector(X, y, kwargs)
233 else:
234 self.printer.print(f'Running feature selection with {self._model}', order = 2)
File ~\AppData\Roaming\Python\Python310\site-packages\verstack\FeatureSelector.py:294, in FeatureSelector._auto_linear_randomforest_selector(self, X, y, kwargs)
291 selector_rf = self._get_selector(randomforest_model, y, kwargs)
293 self.printer.print(f'Running feature selection with {linear_model}', order = 2)
--> 294 feats_lr_flags = self._prepare_data_apply_selector(X, y, selector_lr, scale_data = True)
296 self.printer.print(f'Running feature selection with {randomforest_model}', order = 2)
297 feats_rf_flags = self._prepare_data_apply_selector(X, y, selector_rf, scale_data = False)
File ~\AppData\Roaming\Python\Python310\site-packages\verstack\FeatureSelector.py:251, in FeatureSelector._prepare_data_apply_selector(self, X, y, selector, scale_data)
249 X_subset, y_subset = self._subset_data(X, y)
250 if scale_data:
--> 251 X_subset = self._scale_data(X_subset)
252 try:
253 X_subset, y_subset = self._transform_data_to_float_32(X_subset, y_subset)
File ~\AppData\Roaming\Python\Python310\site-packages\verstack\FeatureSelector.py:499, in FeatureSelector._scale_data(self, X)
497 from sklearn.preprocessing import StandardScaler
498 scaler = StandardScaler()
--> 499 X = scaler.fit_transform(X)
500 return X
File C:\Anaconda3\envs\python_310\lib\site-packages\sklearn\base.py:867, in TransformerMixin.fit_transform(self, X, y, **fit_params)
863 # non-optimized default implementation; override when a better
864 # method is possible for a given clustering algorithm
865 if y is None:
866 # fit method of arity 1 (unsupervised transformation)
--> 867 return self.fit(X, **fit_params).transform(X)
868 else:
869 # fit method of arity 2 (supervised transformation)
870 return self.fit(X, y, **fit_params).transform(X)
File C:\Anaconda3\envs\python_310\lib\site-packages\sklearn\preprocessing\_data.py:809, in StandardScaler.fit(self, X, y, sample_weight)
807 # Reset internal state before fitting
808 self._reset()
--> 809 return self.partial_fit(X, y, sample_weight)
File C:\Anaconda3\envs\python_310\lib\site-packages\sklearn\preprocessing\_data.py:844, in StandardScaler.partial_fit(self, X, y, sample_weight)
812 """Online computation of mean and std on X for later scaling.
813
814 All of X is processed as a single batch. This is intended for cases
(...)
841 Fitted scaler.
842 """
843 first_call = not hasattr(self, "n_samples_seen_")
--> 844 X = self._validate_data(
845 X,
846 accept_sparse=("csr", "csc"),
847 dtype=FLOAT_DTYPES,
848 force_all_finite="allow-nan",
849 reset=first_call,
850 )
851 n_features = X.shape[1]
853 if sample_weight is not None:
File C:\Anaconda3\envs\python_310\lib\site-packages\sklearn\base.py:577, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
575 raise ValueError("Validation should be done on X, y or both.")
576 elif not no_val_X and no_val_y:
--> 577 X = check_array(X, input_name="X", **check_params)
578 out = X
579 elif no_val_X and not no_val_y:
File C:\Anaconda3\envs\python_310\lib\site-packages\sklearn\utils\validation.py:856, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
854 array = array.astype(dtype, casting="unsafe", copy=False)
855 else:
--> 856 array = np.asarray(array, order=order, dtype=dtype)
857 except ComplexWarning as complex_warning:
858 raise ValueError(
859 "Complex data not supported\n{}\n".format(array)
860 ) from complex_warning
File C:\Anaconda3\envs\python_310\lib\site-packages\pandas\core\generic.py:2070, in NDFrame.__array__(self, dtype)
2069 def __array__(self, dtype: npt.DTypeLike | None = None) -> np.ndarray:
-> 2070 return np.asarray(self._values, dtype=dtype)
ValueError: could not convert string to float: 'x'
Could you help me what am I doing wrong?
Thanks, balgad
ps.: anyway, it's a great package! :)
Hi. Looks like your data contains string characters. Are you sure it is all numeric?
Try: X_encoded.dtypes
Thank you for your quick answer. Yes, you were right I didn't notice a string column, my mistake, sorry for reporting this as a problem.
Anyway, after removing the string column I got a new error which is in connection with using parameter 'auto = True'. Could you help me what does cause this (and how should I setup to use auto model selection)?
from verstack import FeatureSelector
FS = FeatureSelector(objective = 'classification', auto = True, error_score = 'raise')
selected_feats = FS.fit_transform(X_encoded, y)
Message:
ValueError Traceback (most recent call last)
Cell In[67], line 3
1 from verstack import FeatureSelector
2 FS = FeatureSelector(objective = 'classification', auto = True, error_score = 'raise')
----> 3 selected_feats = FS.fit_transform(X_encoded, y)
File ~\AppData\Roaming\Python\Python310\site-packages\verstack\tools.py:19, in timer.<locals>.wrapped(*args, **kwargs)
16 @wraps(func)
17 def wrapped(*args, **kwargs):
18 start = time.time()
---> 19 result = func(*args, **kwargs)
20 end = time.time()
21 elapsed = round(end-start,5)
File ~\AppData\Roaming\Python\Python310\site-packages\verstack\FeatureSelector.py:232, in FeatureSelector.fit_transform(self, X, y, **kwargs)
230 if self.auto:
231 self.printer.print(f'Comparing LinearRegression and RandomForest for feature selection', order = 2)
--> 232 self._auto_linear_randomforest_selector(X, y, kwargs)
233 else:
234 self.printer.print(f'Running feature selection with {self._model}', order = 2)
File ~\AppData\Roaming\Python\Python310\site-packages\verstack\FeatureSelector.py:302, in FeatureSelector._auto_linear_randomforest_selector(self, X, y, kwargs)
300 model = self._get_final_scoring_model(y)
301 self.printer.print(f'Scoring selected feats from linear and RF models by: {model}', order = 2)
--> 302 score_lr = np.mean(cvs(model, X[X.columns[feats_lr_flags]], y, scoring = scoring, cv = 3))
303 score_rf = np.mean(cvs(model, X[X.columns[feats_rf_flags]], y, scoring = scoring, cv = 3))
304 self.printer.print(f'RFE by linear model cv-score : {np.round(score_lr,5)}', order = 4, leading_blank_paragraph=True)
File C:\Anaconda3\envs\python_310\lib\site-packages\sklearn\model_selection\_validation.py:515, in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
512 # To ensure multimetric format is not supported
513 scorer = check_scoring(estimator, scoring=scoring)
--> 515 cv_results = cross_validate(
516 estimator=estimator,
517 X=X,
518 y=y,
519 groups=groups,
520 scoring={"score": scorer},
521 cv=cv,
522 n_jobs=n_jobs,
523 verbose=verbose,
524 fit_params=fit_params,
525 pre_dispatch=pre_dispatch,
526 error_score=error_score,
527 )
528 return cv_results["test_score"]
File C:\Anaconda3\envs\python_310\lib\site-packages\sklearn\model_selection\_validation.py:285, in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
265 parallel = Parallel(n_jobs=n_jobs, verbose=verbose, pre_dispatch=pre_dispatch)
266 results = parallel(
267 delayed(_fit_and_score)(
268 clone(estimator),
(...)
282 for train, test in cv.split(X, y, groups)
283 )
--> 285 _warn_or_raise_about_fit_failures(results, error_score)
287 # For callabe scoring, the return type is only know after calling. If the
288 # return type is a dictionary, the error scores can now be inserted with
289 # the correct key.
290 if callable(scoring):
File C:\Anaconda3\envs\python_310\lib\site-packages\sklearn\model_selection\_validation.py:367, in _warn_or_raise_about_fit_failures(results, error_score)
360 if num_failed_fits == num_fits:
361 all_fits_failed_message = (
362 f"\nAll the {num_fits} fits failed.\n"
363 "It is very likely that your model is misconfigured.\n"
364 "You can try to debug the error by setting error_score='raise'.\n\n"
365 f"Below are more details about the failures:\n{fit_errors_summary}"
366 )
--> 367 raise ValueError(all_fits_failed_message)
369 else:
370 some_fits_failed_message = (
371 f"\n{num_failed_fits} fits failed out of a total of {num_fits}.\n"
372 "The score on these train-test partitions for these parameters"
(...)
376 f"Below are more details about the failures:\n{fit_errors_summary}"
377 )
ValueError:
All the 3 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
3 fits failed with the following error:
Traceback (most recent call last):
File "C:\Anaconda3\envs\python_310\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\Users\LaszloV\AppData\Roaming\Python\Python310\site-packages\lightgbm\sklearn.py", line 967, in fit
super().fit(X, _y, sample_weight=sample_weight, init_score=init_score, eval_set=valid_sets,
File "C:\Users\LaszloV\AppData\Roaming\Python\Python310\site-packages\lightgbm\sklearn.py", line 748, in fit
self._Booster = train(
File "C:\Users\LaszloV\AppData\Roaming\Python\Python310\site-packages\lightgbm\engine.py", line 271, in train
booster = Booster(params=params, train_set=train_set)
File "C:\Users\LaszloV\AppData\Roaming\Python\Python310\site-packages\lightgbm\basic.py", line 2605, in __init__
train_set.construct()
File "C:\Users\LaszloV\AppData\Roaming\Python\Python310\site-packages\lightgbm\basic.py", line 1815, in construct
self._lazy_init(self.data, label=self.label,
File "C:\Users\LaszloV\AppData\Roaming\Python\Python310\site-packages\lightgbm\basic.py", line 1474, in _lazy_init
data, feature_name, categorical_feature, self.pandas_categorical = _data_from_pandas(data,
File "C:\Users\LaszloV\AppData\Roaming\Python\Python310\site-packages\lightgbm\basic.py", line 594, in _data_from_pandas
raise ValueError("DataFrame.dtypes for data must be int, float or bool.\n"
ValueError: DataFrame.dtypes for data must be int, float or bool.
Did not expect the data types in the following fields: odds_home_team_win, odds_away_team_win, home_team_match_nr, away_team_match_nr, away_team_fault_vs_shot_roll4_sum
If I turn off auto (set to default False) it works perfectly
from sklearn.svm import SVC, LinearSVC, NuSVC
from verstack import FeatureSelector
FS = FeatureSelector(objective = 'classification', error_score = 'raise')
selected_feats = FS.fit_transform(X_encoded, y)
* Initiating FeatureSelector
- Running feature selection with RandomForestClassifier(max_depth=2, n_estimators=50)
. Data decreased for experiments. Working with 11.66% of data
. Selected 5 features from 311
Time elapsed for fit_transform execution: 2 min 20.762 sec
It is odd, the error message (below) indicates that the 4 columns have unsupported data types.
ValueError: DataFrame.dtypes for data must be int, float or bool. Did not expect the data types in the following fields: odds_home_team_win, odds_away_team_win, home_team_match_nr, away_team_match_nr, away_team_fault_vs_shot_roll4_sum
Can you check what are the data types in these columns?
If you could share the data sample, I will look into it as well.
Thank you for the answer. I checked the datatypes of dataframe but all columns seem to be fine.
The 4 columns have normal datatypes:
Unfortunately I could not share a sample (but I'm trying to solve the problem) Do you have an idea where is the problem?
Thank you
This may be the issue with the logistic regression. Although unlikely that it may solve the issue, but just in case try to convert the [odds_home_team_win, odds_away_team_win, home_team_match_nr, away_team_match_nr, away_team_fault_vs_shot_roll4_sum] into float32