scikit-lego icon indicating copy to clipboard operation
scikit-lego copied to clipboard

[FEATURE] Ability to stratify with cols that contain some Nans values, this way people can hyperparameter tune best imputation methods

Open dec1costello opened this issue 1 year ago • 1 comments

Hello!

  • [ ] I have a training pipeline that hyperparameter tunes the best imputation method
  • [ ] My pipeline fails when sklearn's train_test_split(stratify=stratify_data) is insufficient with cols containing Nan values
  • [ ] Curious if this seems like a scikit-lego feature people would want

Here's my attempt to stratify cols with some Nans for more context, I am a beginner so open to better ideas or comments if this feature request is out of scope. Thanks in advance!! Appreciate everyone's contributions to this package!

Strat attempt:

X = result_df[feature_cols]
y = result_df['strokes_to_hole_out']

#Extract the columns for stratification
stratify_cols = ['from_location_scorer','from_location_laser']
stratify_data = result_df[stratify_cols]

#Split the data, using 'stratify_data' for stratification
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42, stratify=stratify_data)

error I receive come training: Trial failed with exception: Found unknown categories ['blue'] in column 9 during transform

dec1costello avatar Jun 26 '24 01:06 dec1costello

Hey @dec1costello , thank for the feature request. I have a few questions:

  • Could you provide some minimal input data?
  • Could you provide some minimal expected output data?
  • The error seems to be related to a transformer failing in the .transform(X_valid) step. How would the proposal fix that?

FBruzzesi avatar Jun 26 '24 07:06 FBruzzesi