verstack icon indicating copy to clipboard operation
verstack copied to clipboard

could not convert string to float: 'x' - using FeatureSelector

Open balgad opened this issue 1 year ago • 5 comments

When trying to use FeatureSelector I got "" message.

Command I use (python 3.10):

from verstack import FeatureSelector FS = FeatureSelector(objective = 'classification', auto = True) selected_feats = FS.fit_transform(X_encoded, y)

Error call stack:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[24], line 3
      1 from verstack import FeatureSelector
      2 FS = FeatureSelector(objective = 'classification', auto = True)
----> 3 selected_feats = FS.fit_transform(X_encoded, y)

File ~\AppData\Roaming\Python\Python310\site-packages\verstack\tools.py:19, in timer.<locals>.wrapped(*args, **kwargs)
     16 @wraps(func)
     17 def wrapped(*args, **kwargs):
     18     start = time.time()
---> 19     result = func(*args, **kwargs)
     20     end = time.time()
     21     elapsed = round(end-start,5)

File ~\AppData\Roaming\Python\Python310\site-packages\verstack\FeatureSelector.py:232, in FeatureSelector.fit_transform(self, X, y, **kwargs)
    230 if self.auto:
    231     self.printer.print(f'Comparing LinearRegression and RandomForest for feature selection', order = 2)
--> 232     self._auto_linear_randomforest_selector(X, y, kwargs)
    233 else:
    234     self.printer.print(f'Running feature selection with {self._model}', order = 2)

File ~\AppData\Roaming\Python\Python310\site-packages\verstack\FeatureSelector.py:294, in FeatureSelector._auto_linear_randomforest_selector(self, X, y, kwargs)
    291 selector_rf = self._get_selector(randomforest_model, y, kwargs)
    293 self.printer.print(f'Running feature selection with {linear_model}', order = 2)
--> 294 feats_lr_flags = self._prepare_data_apply_selector(X, y, selector_lr, scale_data = True)
    296 self.printer.print(f'Running feature selection with {randomforest_model}', order = 2)
    297 feats_rf_flags = self._prepare_data_apply_selector(X, y, selector_rf, scale_data = False)

File ~\AppData\Roaming\Python\Python310\site-packages\verstack\FeatureSelector.py:251, in FeatureSelector._prepare_data_apply_selector(self, X, y, selector, scale_data)
    249 X_subset, y_subset = self._subset_data(X, y)
    250 if scale_data:
--> 251     X_subset = self._scale_data(X_subset)
    252 try:
    253     X_subset, y_subset = self._transform_data_to_float_32(X_subset, y_subset)

File ~\AppData\Roaming\Python\Python310\site-packages\verstack\FeatureSelector.py:499, in FeatureSelector._scale_data(self, X)
    497 from sklearn.preprocessing import StandardScaler
    498 scaler = StandardScaler()
--> 499 X = scaler.fit_transform(X)
    500 return X

File C:\Anaconda3\envs\python_310\lib\site-packages\sklearn\base.py:867, in TransformerMixin.fit_transform(self, X, y, **fit_params)
    863 # non-optimized default implementation; override when a better
    864 # method is possible for a given clustering algorithm
    865 if y is None:
    866     # fit method of arity 1 (unsupervised transformation)
--> 867     return self.fit(X, **fit_params).transform(X)
    868 else:
    869     # fit method of arity 2 (supervised transformation)
    870     return self.fit(X, y, **fit_params).transform(X)

File C:\Anaconda3\envs\python_310\lib\site-packages\sklearn\preprocessing\_data.py:809, in StandardScaler.fit(self, X, y, sample_weight)
    807 # Reset internal state before fitting
    808 self._reset()
--> 809 return self.partial_fit(X, y, sample_weight)

File C:\Anaconda3\envs\python_310\lib\site-packages\sklearn\preprocessing\_data.py:844, in StandardScaler.partial_fit(self, X, y, sample_weight)
    812 """Online computation of mean and std on X for later scaling.
    813 
    814 All of X is processed as a single batch. This is intended for cases
   (...)
    841     Fitted scaler.
    842 """
    843 first_call = not hasattr(self, "n_samples_seen_")
--> 844 X = self._validate_data(
    845     X,
    846     accept_sparse=("csr", "csc"),
    847     dtype=FLOAT_DTYPES,
    848     force_all_finite="allow-nan",
    849     reset=first_call,
    850 )
    851 n_features = X.shape[1]
    853 if sample_weight is not None:

File C:\Anaconda3\envs\python_310\lib\site-packages\sklearn\base.py:577, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
    575     raise ValueError("Validation should be done on X, y or both.")
    576 elif not no_val_X and no_val_y:
--> 577     X = check_array(X, input_name="X", **check_params)
    578     out = X
    579 elif no_val_X and not no_val_y:

File C:\Anaconda3\envs\python_310\lib\site-packages\sklearn\utils\validation.py:856, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    854         array = array.astype(dtype, casting="unsafe", copy=False)
    855     else:
--> 856         array = np.asarray(array, order=order, dtype=dtype)
    857 except ComplexWarning as complex_warning:
    858     raise ValueError(
    859         "Complex data not supported\n{}\n".format(array)
    860     ) from complex_warning

File C:\Anaconda3\envs\python_310\lib\site-packages\pandas\core\generic.py:2070, in NDFrame.__array__(self, dtype)
   2069 def __array__(self, dtype: npt.DTypeLike | None = None) -> np.ndarray:
-> 2070     return np.asarray(self._values, dtype=dtype)

ValueError: could not convert string to float: 'x'

Could you help me what am I doing wrong?

Thanks, balgad

ps.: anyway, it's a great package! :)

balgad avatar Mar 11 '23 16:03 balgad

Hi. Looks like your data contains string characters. Are you sure it is all numeric?

Try: X_encoded.dtypes

DanilZherebtsov avatar Mar 11 '23 16:03 DanilZherebtsov

Thank you for your quick answer. Yes, you were right I didn't notice a string column, my mistake, sorry for reporting this as a problem.

Anyway, after removing the string column I got a new error which is in connection with using parameter 'auto = True'. Could you help me what does cause this (and how should I setup to use auto model selection)?

from verstack import FeatureSelector
FS = FeatureSelector(objective = 'classification', auto = True, error_score = 'raise')
selected_feats = FS.fit_transform(X_encoded, y)

Message:

ValueError                                Traceback (most recent call last)
Cell In[67], line 3
      1 from verstack import FeatureSelector
      2 FS = FeatureSelector(objective = 'classification', auto = True, error_score = 'raise')
----> 3 selected_feats = FS.fit_transform(X_encoded, y)

File ~\AppData\Roaming\Python\Python310\site-packages\verstack\tools.py:19, in timer.<locals>.wrapped(*args, **kwargs)
     16 @wraps(func)
     17 def wrapped(*args, **kwargs):
     18     start = time.time()
---> 19     result = func(*args, **kwargs)
     20     end = time.time()
     21     elapsed = round(end-start,5)

File ~\AppData\Roaming\Python\Python310\site-packages\verstack\FeatureSelector.py:232, in FeatureSelector.fit_transform(self, X, y, **kwargs)
    230 if self.auto:
    231     self.printer.print(f'Comparing LinearRegression and RandomForest for feature selection', order = 2)
--> 232     self._auto_linear_randomforest_selector(X, y, kwargs)
    233 else:
    234     self.printer.print(f'Running feature selection with {self._model}', order = 2)

File ~\AppData\Roaming\Python\Python310\site-packages\verstack\FeatureSelector.py:302, in FeatureSelector._auto_linear_randomforest_selector(self, X, y, kwargs)
    300 model = self._get_final_scoring_model(y)        
    301 self.printer.print(f'Scoring selected feats from linear and RF models by: {model}', order = 2)
--> 302 score_lr = np.mean(cvs(model, X[X.columns[feats_lr_flags]], y, scoring = scoring, cv = 3))
    303 score_rf = np.mean(cvs(model, X[X.columns[feats_rf_flags]], y, scoring = scoring, cv = 3))
    304 self.printer.print(f'RFE by linear model cv-score       : {np.round(score_lr,5)}', order = 4, leading_blank_paragraph=True)

File C:\Anaconda3\envs\python_310\lib\site-packages\sklearn\model_selection\_validation.py:515, in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
    512 # To ensure multimetric format is not supported
    513 scorer = check_scoring(estimator, scoring=scoring)
--> 515 cv_results = cross_validate(
    516     estimator=estimator,
    517     X=X,
    518     y=y,
    519     groups=groups,
    520     scoring={"score": scorer},
    521     cv=cv,
    522     n_jobs=n_jobs,
    523     verbose=verbose,
    524     fit_params=fit_params,
    525     pre_dispatch=pre_dispatch,
    526     error_score=error_score,
    527 )
    528 return cv_results["test_score"]

File C:\Anaconda3\envs\python_310\lib\site-packages\sklearn\model_selection\_validation.py:285, in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
    265 parallel = Parallel(n_jobs=n_jobs, verbose=verbose, pre_dispatch=pre_dispatch)
    266 results = parallel(
    267     delayed(_fit_and_score)(
    268         clone(estimator),
   (...)
    282     for train, test in cv.split(X, y, groups)
    283 )
--> 285 _warn_or_raise_about_fit_failures(results, error_score)
    287 # For callabe scoring, the return type is only know after calling. If the
    288 # return type is a dictionary, the error scores can now be inserted with
    289 # the correct key.
    290 if callable(scoring):

File C:\Anaconda3\envs\python_310\lib\site-packages\sklearn\model_selection\_validation.py:367, in _warn_or_raise_about_fit_failures(results, error_score)
    360 if num_failed_fits == num_fits:
    361     all_fits_failed_message = (
    362         f"\nAll the {num_fits} fits failed.\n"
    363         "It is very likely that your model is misconfigured.\n"
    364         "You can try to debug the error by setting error_score='raise'.\n\n"
    365         f"Below are more details about the failures:\n{fit_errors_summary}"
    366     )
--> 367     raise ValueError(all_fits_failed_message)
    369 else:
    370     some_fits_failed_message = (
    371         f"\n{num_failed_fits} fits failed out of a total of {num_fits}.\n"
    372         "The score on these train-test partitions for these parameters"
   (...)
    376         f"Below are more details about the failures:\n{fit_errors_summary}"
    377     )

ValueError: 
All the 3 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
3 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Anaconda3\envs\python_310\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\LaszloV\AppData\Roaming\Python\Python310\site-packages\lightgbm\sklearn.py", line 967, in fit
    super().fit(X, _y, sample_weight=sample_weight, init_score=init_score, eval_set=valid_sets,
  File "C:\Users\LaszloV\AppData\Roaming\Python\Python310\site-packages\lightgbm\sklearn.py", line 748, in fit
    self._Booster = train(
  File "C:\Users\LaszloV\AppData\Roaming\Python\Python310\site-packages\lightgbm\engine.py", line 271, in train
    booster = Booster(params=params, train_set=train_set)
  File "C:\Users\LaszloV\AppData\Roaming\Python\Python310\site-packages\lightgbm\basic.py", line 2605, in __init__
    train_set.construct()
  File "C:\Users\LaszloV\AppData\Roaming\Python\Python310\site-packages\lightgbm\basic.py", line 1815, in construct
    self._lazy_init(self.data, label=self.label,
  File "C:\Users\LaszloV\AppData\Roaming\Python\Python310\site-packages\lightgbm\basic.py", line 1474, in _lazy_init
    data, feature_name, categorical_feature, self.pandas_categorical = _data_from_pandas(data,
  File "C:\Users\LaszloV\AppData\Roaming\Python\Python310\site-packages\lightgbm\basic.py", line 594, in _data_from_pandas
    raise ValueError("DataFrame.dtypes for data must be int, float or bool.\n"
ValueError: DataFrame.dtypes for data must be int, float or bool.
Did not expect the data types in the following fields: odds_home_team_win, odds_away_team_win, home_team_match_nr, away_team_match_nr, away_team_fault_vs_shot_roll4_sum

If I turn off auto (set to default False) it works perfectly

from sklearn.svm import SVC, LinearSVC, NuSVC
from verstack import FeatureSelector
FS = FeatureSelector(objective = 'classification', error_score = 'raise')
selected_feats = FS.fit_transform(X_encoded, y)
 * Initiating FeatureSelector

   - Running feature selection with RandomForestClassifier(max_depth=2, n_estimators=50)
     . Data decreased for experiments. Working with 11.66% of data
     . Selected 5 features from 311

Time elapsed for fit_transform execution: 2 min 20.762 sec

balgad avatar Mar 11 '23 18:03 balgad

It is odd, the error message (below) indicates that the 4 columns have unsupported data types.

ValueError: DataFrame.dtypes for data must be int, float or bool. Did not expect the data types in the following fields: odds_home_team_win, odds_away_team_win, home_team_match_nr, away_team_match_nr, away_team_fault_vs_shot_roll4_sum

Can you check what are the data types in these columns?

If you could share the data sample, I will look into it as well.

DanilZherebtsov avatar Mar 12 '23 07:03 DanilZherebtsov

Thank you for the answer. I checked the datatypes of dataframe but all columns seem to be fine.

The 4 columns have normal datatypes: image

Unfortunately I could not share a sample (but I'm trying to solve the problem) Do you have an idea where is the problem?

Thank you

balgad avatar Mar 14 '23 19:03 balgad

This may be the issue with the logistic regression. Although unlikely that it may solve the issue, but just in case try to convert the [odds_home_team_win, odds_away_team_win, home_team_match_nr, away_team_match_nr, away_team_fault_vs_shot_roll4_sum] into float32

DanilZherebtsov avatar Mar 28 '23 09:03 DanilZherebtsov