FEA Add `TextEncoder`
Reference https://github.com/skrub-data/skrub/issues/1047
What does this PR propose?
This PR wraps the SentenceTransformers library, in the same fashion as ragger-duck and embetter does.
Following the good empirical baselines of arxiv.org/abs/2312.09634, this PR also proposes to:
- Add a PCA within the estimator, controlled by a
n_componentsparameter. This also allows the class to stay consistent with the other skrub encoders and perform HP tuning more easily. Following the paper, we setn_components=30by default. - Add a normalization option, with a default "l2". This simplifies cosine distance to dot products and brings embeddings as comparable features to distance-based estimators (e.g., NearestNeighbors). This is useful when other normalized features are used (fuzzy joining).
What the SentenceEncoder does not:
- perform weights or embedding quantization (kept for later research)
- set torch threading, which doesn't interact well with joblib and raises an error if called twice.
- choose a different embedding Pooling scheme (only mean pooling is implemented in SentenceTransformer)
See the implementation of the paper for more details.
Todo in another PR Enriching an existing example from skrub with the SentenceEncoder.
Pixi is extremely annoying. It has negative returns on productivity as far as I am concerned.
I have an intriguing segfault due to pytorch, that I can't reproduce locally with pixi or mamba. I'd be happy to get feedback on this. I suspect pixi doesn't download torch along with sentence-transformers from Pypi.
OK, so it seems that we would be able to make the CI works with sentence-transformers=2.X. Basically, the latest release have not been created on conda-forge but I'm trying to work something out: https://github.com/conda-forge/sentence-transformers-feedstock
This CI effort is worth a couple of π» (I still don't get how people deal with the DL stack without being an expert).
This CI effort is worth a couple of π» (I still don't get how people deal with the DL stack without being an expert).
Wouhou! I'm buying the first round π
There are a lot of codecov warnings on the _sentence_encoder.py file, but this is a display bug; the coverage is close to 100%.
We will be able to remove torchvision from the optional dependency once @glemaitre PR on conda forge is merged. Having a dep on torchvision is not detrimental for this POC, though.
I also wonder about pickling. maybe we should not store the model in an attribute but only store the name, and reload the model as necessary? or if we are worried about loading it many times we can cache it in an attribute that does not get pickled and repopulate the cache when necessary
cache it in an attribute that does not get pickled
this can be done by defining __getstate__ as shown here
here is a toy example with functools.cached_property
encoder.py
import functools
class Encoder:
def __init__(self, name):
self.name = name
def transform(self, x):
return self._estimator(x)
@functools.cached_property
def _estimator(self):
print(f'loading {self.name}')
return lambda x: f"{self.name}({x})"
def __getstate__(self):
state = self.__dict__.copy()
if '_estimator' in state:
del state['_estimator']
return state
usage
>>> import pickle
>>> import encoder
>>> e = encoder.Encoder('supermodel')
>>> print(e.transform(7))
loading supermodel
supermodel(7)
the model is only loaded once
>>> print(e.transform(8))
supermodel(8)
pickling works
>>> s = pickle.dumps(e)
>>> e1 = pickle.loads(s)
after unpicling we reload the model
>>> print(e1.transform(9))
loading supermodel
supermodel(9)
>>> print(e1.transform(10))
supermodel(10)
the `_estimator` itself could not be pickled
>>> pickle.dumps(e._estimator)
Traceback (most recent call last):
...
AttributeError: Can't pickle local object 'Encoder._estimator.<locals>.<lambda>'
TextEmbedding is nice, the only issue is vectors and embeddings are close concepts, so people might not understand its difference from other string-based encoders like MinHash. What about LLMEncoder?
Maybe DeepEncoder? I think if we put "LLM" in the name some users will think they can only use it if they have a big GPU
TextEmbedding is nice, the only issue is vectors and embeddings are close concepts, so people might not understand its difference from other string-based encoders like MinHash. What about LLMEncoder? Maybe DeepEncoder? I think if we put "LLM" in the name some users will think they can only use it if they have a big GPU
HuggingFaceEncoder? to be more explicit
I also wonder about pickling. maybe we should not store the model in an attribute but only store the name, and reload the model as necessary? or if we are worried about loading it many times we can cache it in an attribute that does not get pickled and repopulate the cache when necessary
Why are you concerned about pickling? Note that loading using names wouldn't work for local models in another context. For example, if a user fine-tunes a sentence-transformer and gives the local path to the SentenceEncoder, pickling the SentenceEncoder and unpickling on, e.g., a remote server would fail because the model weights aren't there anymore. Users would have to ship the weights along with the pickle object and make sure the versions are correct, which is a bit burdensome.
I wonder if we could showcase it in an example? or is it too computationally expensive to run on circleci?
Good question, I was thinking of introducing an example in another PR but we can definitely do it here as well. I guess we should be fine as the github workers have 16GB of RAM. Downloading the weights might be a bit slow though.
I typically have a rule to never introduce a major feature without a corresponding example. I find that writing the example helps so much finding usability improvements.
Downloading the weights might be a bit slow though.
we can use the circleci cache so that downloading models does not happen often (if/when that turns out to be a problem)
Note that loading using names wouldn't work for local models in another context.
yes that's true, I was assumign the main use-case would be to use a pre-trained model directly from the huggingface hub. the reason I wonder about pickling is (i) model size: I imagine we might end up with quite a few variants of a skrub pipeline, or fitted models for different splits etc., but all of them using the same hugging face model, so including a copy of it in each pickle might not be optimal. but most importantly (ii) if the model has its weights on a gpu, what happens when we pickle it and unpickle on a machine that does not have the same device? if we get an error that could be a problem. the problem is we don't have a way to check this in the CI because the runners don't have a gpu. I'm also ok with saying we will investigate this further in another issue. still I would at least add a test to check that pickling and unpickling in the simple case (same machine, cpu, the only setup we have in the CI) works
the problem is we don't have a way to check this in the CI because the runners don't have a gpu
We actually have the tiny Jetson Nano at Inria which comes with GPU! π Additionally, I can try using "mps" on my macos-arm64 machine.
still I would at least add a test to check that pickling and unpickling in the simple case (same machine, cpu, the only setup we have in the CI) works
Right, let's add this.
Regarding loading, pickling, and memory usage, have you considered if this is a case for a Borg pattern (refs: https://www.oreilly.com/library/view/python-cookbook/0596001673/ch05s23.html https://stackoverflow.com/questions/1318406/why-is-the-borg-pattern-better-than-the-singleton-pattern-in-python )?
Thanks for the reference! If I understand correctly, this pattern could help maintaining the same state accross different instances, but I wonder how that could help in terms of memory. Say I have a shared model accross different instances, if I run them parallel I guess the RAM consumption will still explode, right? I mean, I have to bring the matrices somewhere in memory to perform my parallel forward passes. I don't see how this could help with pickling, however. @jeromedockes WDYT?
I don't see how this could help with pickling, however. @jeromedockes WDYT?
Indeed my question was not really about memory usage or sharing state between objects but rather whether the model weights should be included in the pickle file. (i) if yes, then it cannot be unpickled in a machine that does not have the same device (without taking extra steps, AFAIK, I may very well be mistaken on this one), and the pickle may be much bigger than it needs to be. for example I can see several skrub/scikit-learn pipelines re-using the same sentence encoder model (ii) if no, then it cannot be unpickled on a machine that does not have a copy of the sentenceencoder or a way to download it from the huggingface hub.
So neither option is perfect, we have to guess which is least likely to bother many users (and possibly add a parameter to let the user choose, in this pr or a later one)
@GaelVaroquaux, I'd like to merge this PR now and, instead of adding a separate example, include the example in the current GapEncoder one in a separate PR.
The goal would be to compare the TextEncoder, MinHashEncoder, and perhaps some LSA as well, using the toxicity dataset that Jovan shared yesterday. It's a small text dataset (about 1000 entries), but I think itβs a good choice to illustrate the tradeoff between the TextEncoder's computation time and performance.
I'd like to merge this PR now and, instead of adding a separate example, include the example in the current GapEncoder one in a separate PR.
I would like to have a hard rule that no major feature gets merged without being in at least one example. In my experience this is good practice.
Okay, but I have to raise awareness that adding the dataset plus revamping the example might create a big PR and delay its merging.
Fair enough, but my experience is that often going through the process of the example forces to rethink the functionality and thus is really useful.
As long as we don't have the example, we're not sure that it's ready for the user.
Thanks!!
On Oct 16, 2024, 14:48, at 14:48, Vincent M @.***> wrote:
Okay, but I have to raise awareness that adding the dataset plus revamping the example might create a big PR and delay its merging.
-- Reply to this email directly or view it on GitHub: https://github.com/skrub-data/skrub/pull/1077#issuecomment-2416748486 You are receiving this because you were mentioned.
Message ID: @.***>
going through the process of the example forces to rethink the functionality and thus is really useful.
Got it, this sounds healthy indeed!
screenshot from the rendered example :sweat_smile: :
Ping @jovan-stojanovic :)
@GaelVaroquaux can we merge this one?
I'm having a look right now. I played with the example on my computer. It's really cool
I've created a small PR to your branch here: https://github.com/Vincent-Maladiere/skrub/pull/1 If you're happy with it, can you merge it please?
Also, there is a conflict with main, which means that we cannot merge