scikit-lego
scikit-lego copied to clipboard
[FEATURE] Ability to stratify with cols that contain some Nans values, this way people can hyperparameter tune best imputation methods
Hello!
- [ ] I have a training pipeline that hyperparameter tunes the best imputation method
- [ ] My pipeline fails when sklearn's train_test_split(stratify=stratify_data) is insufficient with cols containing Nan values
- [ ] Curious if this seems like a scikit-lego feature people would want
Here's my attempt to stratify cols with some Nans for more context, I am a beginner so open to better ideas or comments if this feature request is out of scope. Thanks in advance!! Appreciate everyone's contributions to this package!
Strat attempt:
X = result_df[feature_cols]
y = result_df['strokes_to_hole_out']
#Extract the columns for stratification
stratify_cols = ['from_location_scorer','from_location_laser']
stratify_data = result_df[stratify_cols]
#Split the data, using 'stratify_data' for stratification
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42, stratify=stratify_data)
error I receive come training: Trial failed with exception: Found unknown categories ['blue'] in column 9 during transform
Hey @dec1costello , thank for the feature request. I have a few questions:
- Could you provide some minimal input data?
- Could you provide some minimal expected output data?
- The error seems to be related to a transformer failing in the
.transform(X_valid)step. How would the proposal fix that?