snape icon indicating copy to clipboard operation
snape copied to clipboard

shuffle categoricals before binning

Open mbernico opened this issue 8 years ago • 2 comments

For categoricals like [jan, feb, mar...dec] we should shuffle before applying the binning. The ordinal nature of this categorical binned to the gaussian column is 'too easy.'

mbernico avatar Mar 22 '17 20:03 mbernico

Maybe I misunderstand, but for binning categorical, won't a LabelEncoder implicitly "shuffle" in the sense that it will be ordered by alpha internally? Maybe I'm missing the use case.

On Mar 22, 2017 3:35 PM, "Mike Bernico" [email protected] wrote:

For categoricals like [jan, feb, mar...dec] we should shuffle before applying the binning. The ordinal nature of this categorical binned to the gaussian column is 'too easy.'

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mbernico/snape/issues/13, or mute the thread https://github.com/notifications/unsubscribe-auth/AF10oj97KkUCGEt5wlFvxJaQyPcCepNcks5roYYZgaJpZM4MlyK5 .

tgsmith61591 avatar Mar 22 '17 20:03 tgsmith61591

I might not be thinking about this completely right either. I experienced a situation where some students were able to numerically encode a categorical in a snape dataset column (1=jan, 2=feb), and the results were superior to something like one hot. I think it might be unrealistically easy because of how we do create_categoricals. All snape columns are normal distributions (another weakness prob) so it might be better to shuffle labels that might happen to be ordinal so that strategy works less well?

Consider this label_list=[[jan, feb, mar, ...dec]]

def create_categorical_features(df, label_list, random_state=None): #stuff happens that chooses a random numerical column called 'chosen_col' and then runs df[chosen_col] = pd.cut(df[chosen_col], bins=len(label_list[0]), labels=label_list[0])

What I'm thinking is that it might be more difficult/realistic to do:

def create_categorical_features(df, label_list, random_state=None): #stuff happens that chooses a random numerical column called 'chosen_col' and then runs shuffle(label_list[0]) df[chosen_col] = pd.cut(df[chosen_col], bins=len(label_list[0]), labels=label_list[0])

Please tell me if I'm wrong though, as always!

mbernico avatar Mar 22 '17 22:03 mbernico