feature_engine icon indicating copy to clipboard operation
feature_engine copied to clipboard

Idea: add a `UnstableLabelEncoder`

Open timvink opened this issue 2 years ago • 6 comments

Related to RareLabelEncoder, I wrote an UnstableLabelEncoder that groups categories that are unstable over time.

You define n_time_buckets (for example 5) and a time_variable. Then I cut the time_variable into n_time_buckets and then per variable, per category, I look at the spread (range between min and max) of the normalized value_counts. If it's above tolerence tol, you can consider it stable. Probably clearer in code:

X["tmp_time_bucket_id"] = pd.cut(X[self.time_variable], self.n_time_buckets, labels=False)

for var in self.variables_:
    if len(X[var].unique()) > self.n_categories:

        # per time bucket, the % observations per label
        t = X.groupby(["tmp_time_bucket_id"])[var].value_counts(normalize=True)
        # per label, find the spread (max - min) across the time buckets
        t = t.groupby("cat_veh_body_accssry_txt").agg(np.ptp)

        # stable labels:
        freq_idx = t[t <= self.tol].index

So if any category in a variable varies more than 5% (when tol is 0.05) across the time buckets, you can consider it unstable and replace it with replace_with (defaults to Unstable).

In a machine learning project you would probably set tol to be quite high, like 0.50. This way, if one of a variable's categories starts appearing somewhere in time (or stops appearing), you can avoid using it by throwing it into a generic 'Other' or 'Missing' category (depending how replace_with parameter). This avoids overfitting to a specific time period leading to an overconfident model performance estimate.

The method is related to DropHighPSIFeatures: that one removes the entire feature if it's unstable over time, while UnstableLabelEncoder would remove a category in a feature if it's unstable over time.

I've already written the class, so happy to open a PR if you're interested in including it in feature-engine.

timvink avatar Aug 05 '22 14:08 timvink

I'm still playing with the metric to use for tol. Probably better to use the max absolute percentage difference from the mean. Easier to interpret: a feature category's proportion should not fluctuate more than xx% over time.

timvink avatar Aug 05 '22 17:08 timvink

Hi @timvink

Thanks for the suggestion!

Did you create this method? or was it described somewhere else? if yes, would you be able to add some links for more information?

@gverbock what do you think about this suggestion?

solegalli avatar Aug 06 '22 13:08 solegalli

Did you create this method?

I did.

Would you be able to add some links for more information?

I don't have them. I would need to spend more time looking for papers, running benchmarks and writing about experimental results.

Does it makes sense to close this issue until I have more information?

timvink avatar Aug 08 '22 07:08 timvink

I would leave it open. And whenever you have the time to gather the information, just pin it here :)

solegalli avatar Aug 08 '22 08:08 solegalli

What I find interesting in the discussion is how to deal with unstable categories in a feature. The DropHighPSI approoach is designed to work with numeric variable and the topic of categorical variable is not really addressed. In the current set-up it should be applied after OneHotEncoding and remove the unstable encoded features. (@timvink long time no see).

gverbock avatar Aug 08 '22 09:08 gverbock

Good suggestion Gilles, I will experiment with OHE + DropHighPSI also. (Indeed long time, nice to run into you here!)

timvink avatar Aug 08 '22 14:08 timvink