skrub FEA Add `TextEncoder`

Reference https://github.com/skrub-data/skrub/issues/1047

What does this PR propose?

This PR wraps the SentenceTransformers library, in the same fashion as ragger-duck and embetter does.

Following the good empirical baselines of arxiv.org/abs/2312.09634, this PR also proposes to:

Add a PCA within the estimator, controlled by a n_components parameter. This also allows the class to stay consistent with the other skrub encoders and perform HP tuning more easily. Following the paper, we set n_components=30 by default.
Add a normalization option, with a default "l2". This simplifies cosine distance to dot products and brings embeddings as comparable features to distance-based estimators (e.g., NearestNeighbors). This is useful when other normalized features are used (fuzzy joining).

What the SentenceEncoder does not:

perform weights or embedding quantization (kept for later research)
set torch threading, which doesn't interact well with joblib and raises an error if called twice.
choose a different embedding Pooling scheme (only mean pooling is implemented in SentenceTransformer)

See the implementation of the paper for more details.

Todo in another PR Enriching an existing example from skrub with the SentenceEncoder.

Sep 19 '24 08:09 Vincent-Maladiere

Pixi is extremely annoying. It has negative returns on productivity as far as I am concerned.

Sep 19 '24 09:09 Vincent-Maladiere

I have an intriguing segfault due to pytorch, that I can't reproduce locally with pixi or mamba. I'd be happy to get feedback on this. I suspect pixi doesn't download torch along with sentence-transformers from Pypi.

Sep 19 '24 13:09 Vincent-Maladiere

OK, so it seems that we would be able to make the CI works with sentence-transformers=2.X. Basically, the latest release have not been created on conda-forge but I'm trying to work something out: https://github.com/conda-forge/sentence-transformers-feedstock

Sep 20 '24 07:09 glemaitre

This CI effort is worth a couple of 🍻 (I still don't get how people deal with the DL stack without being an expert).

Sep 20 '24 10:09 glemaitre

This CI effort is worth a couple of 🍻 (I still don't get how people deal with the DL stack without being an expert).

Wouhou! I'm buying the first round 😉

Sep 20 '24 12:09 Vincent-Maladiere

There are a lot of codecov warnings on the _sentence_encoder.py file, but this is a display bug; the coverage is close to 100%.

Sep 20 '24 15:09 Vincent-Maladiere

We will be able to remove torchvision from the optional dependency once @glemaitre PR on conda forge is merged. Having a dep on torchvision is not detrimental for this POC, though.

Sep 24 '24 08:09 Vincent-Maladiere

I also wonder about pickling. maybe we should not store the model in an attribute but only store the name, and reload the model as necessary? or if we are worried about loading it many times we can cache it in an attribute that does not get pickled and repopulate the cache when necessary

Oct 02 '24 12:10 jeromedockes

cache it in an attribute that does not get pickled

this can be done by defining __getstate__ as shown here

Oct 02 '24 12:10 jeromedockes

here is a toy example with functools.cached_property

encoder.py

import functools

class Encoder:
    def __init__(self, name):
        self.name = name

    def transform(self, x):
        return self._estimator(x)

    @functools.cached_property
    def _estimator(self):
        print(f'loading {self.name}')
        return lambda x: f"{self.name}({x})"

    def __getstate__(self):
        state = self.__dict__.copy()
        if '_estimator' in state:
            del state['_estimator']
        return state

usage

>>> import pickle

>>> import encoder

>>> e = encoder.Encoder('supermodel')
>>> print(e.transform(7))
loading supermodel
supermodel(7)

the model is only loaded once

>>> print(e.transform(8))
supermodel(8)

pickling works

>>> s = pickle.dumps(e)
>>> e1 = pickle.loads(s)

after unpicling we reload the model

>>> print(e1.transform(9))
loading supermodel
supermodel(9)
>>> print(e1.transform(10))
supermodel(10)

the `_estimator` itself could not be pickled

>>> pickle.dumps(e._estimator)
Traceback (most recent call last):
    ...
AttributeError: Can't pickle local object 'Encoder._estimator.<locals>.<lambda>'

Oct 02 '24 12:10 jeromedockes

TextEmbedding is nice, the only issue is vectors and embeddings are close concepts, so people might not understand its difference from other string-based encoders like MinHash. What about LLMEncoder?

Maybe DeepEncoder? I think if we put "LLM" in the name some users will think they can only use it if they have a big GPU

Oct 02 '24 13:10 jeromedockes

TextEmbedding is nice, the only issue is vectors and embeddings are close concepts, so people might not understand its difference from other string-based encoders like MinHash. What about LLMEncoder? Maybe DeepEncoder? I think if we put "LLM" in the name some users will think they can only use it if they have a big GPU

HuggingFaceEncoder? to be more explicit

Oct 02 '24 14:10 Vincent-Maladiere

I also wonder about pickling. maybe we should not store the model in an attribute but only store the name, and reload the model as necessary? or if we are worried about loading it many times we can cache it in an attribute that does not get pickled and repopulate the cache when necessary

Why are you concerned about pickling? Note that loading using names wouldn't work for local models in another context. For example, if a user fine-tunes a sentence-transformer and gives the local path to the SentenceEncoder, pickling the SentenceEncoder and unpickling on, e.g., a remote server would fail because the model weights aren't there anymore. Users would have to ship the weights along with the pickle object and make sure the versions are correct, which is a bit burdensome.

Oct 02 '24 14:10 Vincent-Maladiere

I wonder if we could showcase it in an example? or is it too computationally expensive to run on circleci?

Good question, I was thinking of introducing an example in another PR but we can definitely do it here as well. I guess we should be fine as the github workers have 16GB of RAM. Downloading the weights might be a bit slow though.

Oct 02 '24 17:10 Vincent-Maladiere

I typically have a rule to never introduce a major feature without a corresponding example. I find that writing the example helps so much finding usability improvements.

Oct 02 '24 19:10 GaelVaroquaux

Downloading the weights might be a bit slow though.

we can use the circleci cache so that downloading models does not happen often (if/when that turns out to be a problem)

Oct 03 '24 14:10 jeromedockes

Note that loading using names wouldn't work for local models in another context.

yes that's true, I was assumign the main use-case would be to use a pre-trained model directly from the huggingface hub. the reason I wonder about pickling is (i) model size: I imagine we might end up with quite a few variants of a skrub pipeline, or fitted models for different splits etc., but all of them using the same hugging face model, so including a copy of it in each pickle might not be optimal. but most importantly (ii) if the model has its weights on a gpu, what happens when we pickle it and unpickle on a machine that does not have the same device? if we get an error that could be a problem. the problem is we don't have a way to check this in the CI because the runners don't have a gpu. I'm also ok with saying we will investigate this further in another issue. still I would at least add a test to check that pickling and unpickling in the simple case (same machine, cpu, the only setup we have in the CI) works

Oct 03 '24 15:10 jeromedockes

the problem is we don't have a way to check this in the CI because the runners don't have a gpu

We actually have the tiny Jetson Nano at Inria which comes with GPU! 😅 Additionally, I can try using "mps" on my macos-arm64 machine.

still I would at least add a test to check that pickling and unpickling in the simple case (same machine, cpu, the only setup we have in the CI) works

Right, let's add this.

Oct 04 '24 08:10 Vincent-Maladiere

Regarding loading, pickling, and memory usage, have you considered if this is a case for a Borg pattern (refs: https://www.oreilly.com/library/view/python-cookbook/0596001673/ch05s23.html https://stackoverflow.com/questions/1318406/why-is-the-borg-pattern-better-than-the-singleton-pattern-in-python )?

Oct 04 '24 09:10 GaelVaroquaux

Thanks for the reference! If I understand correctly, this pattern could help maintaining the same state accross different instances, but I wonder how that could help in terms of memory. Say I have a shared model accross different instances, if I run them parallel I guess the RAM consumption will still explode, right? I mean, I have to bring the matrices somewhere in memory to perform my parallel forward passes. I don't see how this could help with pickling, however. @jeromedockes WDYT?

Oct 14 '24 10:10 Vincent-Maladiere

I don't see how this could help with pickling, however. @jeromedockes WDYT?

Indeed my question was not really about memory usage or sharing state between objects but rather whether the model weights should be included in the pickle file. (i) if yes, then it cannot be unpickled in a machine that does not have the same device (without taking extra steps, AFAIK, I may very well be mistaken on this one), and the pickle may be much bigger than it needs to be. for example I can see several skrub/scikit-learn pipelines re-using the same sentence encoder model (ii) if no, then it cannot be unpickled on a machine that does not have a copy of the sentenceencoder or a way to download it from the huggingface hub.

So neither option is perfect, we have to guess which is least likely to bother many users (and possibly add a parameter to let the user choose, in this pr or a later one)

Oct 14 '24 15:10 jeromedockes

@GaelVaroquaux, I'd like to merge this PR now and, instead of adding a separate example, include the example in the current GapEncoder one in a separate PR.

The goal would be to compare the TextEncoder, MinHashEncoder, and perhaps some LSA as well, using the toxicity dataset that Jovan shared yesterday. It's a small text dataset (about 1000 entries), but I think it’s a good choice to illustrate the tradeoff between the TextEncoder's computation time and performance.

Oct 16 '24 08:10 Vincent-Maladiere

I'd like to merge this PR now and, instead of adding a separate example, include the example in the current GapEncoder one in a separate PR.

I would like to have a hard rule that no major feature gets merged without being in at least one example. In my experience this is good practice.

Oct 16 '24 08:10 GaelVaroquaux

Okay, but I have to raise awareness that adding the dataset plus revamping the example might create a big PR and delay its merging.

Oct 16 '24 12:10 Vincent-Maladiere

Fair enough, but my experience is that often going through the process of the example forces to rethink the functionality and thus is really useful.

As long as we don't have the example, we're not sure that it's ready for the user.

Thanks!!

On Oct 16, 2024, 14:48, at 14:48, Vincent M @.***> wrote:

Okay, but I have to raise awareness that adding the dataset plus revamping the example might create a big PR and delay its merging.

-- Reply to this email directly or view it on GitHub: https://github.com/skrub-data/skrub/pull/1077#issuecomment-2416748486 You are receiving this because you were mentioned.

Message ID: @.***>

Oct 17 '24 06:10 GaelVaroquaux

going through the process of the example forces to rethink the functionality and thus is really useful.

Got it, this sounds healthy indeed!

Oct 19 '24 21:10 Vincent-Maladiere

screenshot from the rendered example :sweat_smile: :

screenshot_2024-10-22T12:48:21+02:00

Oct 22 '24 10:10 jeromedockes

Ping @jovan-stojanovic :)

Nov 06 '24 17:11 Vincent-Maladiere

@GaelVaroquaux can we merge this one?

Nov 17 '24 11:11 Vincent-Maladiere

I'm having a look right now. I played with the example on my computer. It's really cool

I've created a small PR to your branch here: https://github.com/Vincent-Maladiere/skrub/pull/1 If you're happy with it, can you merge it please?

Also, there is a conflict with main, which means that we cannot merge

Nov 17 '24 21:11 GaelVaroquaux