skrub Normalizing the Gap, Text, and String Encoders

In many (but not all learners), the Euclidean product and Euclidean distance between samples is what matters. For instance, the kNN will retrieve the closest samples in the l2-norm sense.

I believe that it such settings it makes sense to normalize the various string encoders (in particular Gap, Text, and String encoders), so that the average distance between their encoding has the same mean value. Indeed, if one has a large value, it dominates in the choice of a nearest neighbor.

The mean distance is quadratic and thus costly to compute. A good proxy (that is also meaningful) is the average norm.

I did a small script investigating the problem:

# %%
# Our data
from skrub import datasets
data = datasets.fetch_employee_salaries()
df = data.X[['employee_position_title', 'year_first_hired']]


# %%
# Look at distances induced by the different encoders
import numpy as np
import skrub as skb
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import pairwise_distances

encoders = [skb.StringEncoder(),
            skb.TextEncoder(),
            skb.GapEncoder(),
        ]

encoder_distance = {}
encoder_norm = {}

for encoder in encoders:
    # Vectorize the data with the TableVectorizer
    tab_vec = skb.TableVectorizer(
            high_cardinality=encoder,
            numeric=StandardScaler())
    X = tab_vec.fit_transform(df)

    # compute the pairwise distance distribution
    # looking only at the data first hired, or at the
    # other columns
    X_year = X[['year_first_hired']]
    X_other = X.drop(columns='year_first_hired')
    year_dist = pairwise_distances(X_year, metric='euclidean')
    # keep only the lower diagonal
    year_dist = year_dist[*np.tril_indices_from(year_dist)]

    # same thing for the other columns
    other_dist = pairwise_distances(X_other, metric='euclidean')
    other_dist = other_dist[*np.tril_indices_from(other_dist)]

    encoder_distance[encoder.__class__.__name__] = other_dist

    # Look at the norm (easier to compute, because not quadratic)

    year_norm = np.sum((X_year - np.mean(X_year, axis=0))**2) / X.shape[0]
    X_other = X_other.values  # convert to numpy
    other_norm = np.sum((X_other - np.mean(X_other, axis=0))**2) / X.shape[0]
    encoder_norm[encoder.__class__.__name__] = other_norm

# %%
from matplotlib import pyplot as plt
m_dist = np.mean(year_dist)
plt.hist(year_dist,
         label=f"""year column, standardized
         mean distance={m_dist:.2f}
         norm=f{year_norm}""")

for name, values in encoder_distance.items():
    m_dist = np.mean(values)
    plt.hist(values,
             histtype='step',
             label=f"""{name} on employee_position_title column
             mean={m_dist:.2f}
             norm={encoder_norm[name]}""")
plt.xlabel("Distance distribution")
plt.legend()

It shows the problem, and it shows that this problem is particularly marked for the GapEncoder:

Solution I think that it would be good to address this by computing during the train a constant factor (the sqrt of the norm above) to divide the output at transform.

Mar 15 '25 13:03 GaelVaroquaux

Hey @GaelVaroquaux, just to better understand your point, why couldn't we perform row-wise l2 normalization during prediction, like sklearn's TfidfVectorizer?

Apr 02 '25 07:04 Vincent-Maladiere

No, this should be blockwise. Row-wise doesn't make sense here. Think for instance of a null value, represented as a bunch of zeros. Rescaling this does not make sense.

On Apr 2, 2025, 09:03, at 09:03, Vincent M @.***> wrote:

Vincent-Maladiere left a comment (skrub-data/skrub#1253)

Hey @GaelVaroquaux, just to better understand your point, why couldn't we perform row-wise l2 normalization during prediction, like sklearn's TfidfVectorizer?

-- Reply to this email directly or view it on GitHub: https://github.com/skrub-data/skrub/issues/1253#issuecomment-2771525238 You are receiving this because you were mentioned.

Message ID: @.***>

Apr 02 '25 07:04 GaelVaroquaux

Row-wise doesn't make sense here

I confess I don't fully understand 😅. Why is it done in TfidfVectorizer instead of block-wise? Regarding zeros, using normalize returns a zeros vector, which is what I would expect.

import numpy as np
from sklearn.preprocessing import normalize

normalize(np.array([[0, 0, 0]]))

array([[0., 0., 0.]])

Apr 02 '25 08:04 Vincent-Maladiere

I confess I don't fully understand 😅. Why is it done in TfidfVectorizer instead of block-wise?

The whole goal of the TfidfVectorizer is pretty much to do these normalization. It's the logic of the model.

Another reason why I don't want to do this is that it would be moving away quite markedly from the model of the GapEncoder as published. We would probably have to rename to avoid the conclusion, and we should revalidate (which is a huge amount of work), while a global normalization is a minor change.

Apr 02 '25 09:04 GaelVaroquaux

Ok, got it! Thanks for giving your thoughts :)

Apr 02 '25 10:04 Vincent-Maladiere

Closed by #1274

Aug 18 '25 19:08 rcap107