setfit Normalize before using LogisticRegression

Hi,

as far as I can see is that setfit applies LogisticRegression on top of the output of the sentence transformer model. See here:

https://github.com/huggingface/setfit/blob/7735e8e3b208edb8dfb549beb16e585453c5f44e/src/setfit/modeling.py#L49-L55

The problem I see is that the output is not normalized by default. Since we use Cosine Sim. to compare embeddings the length of the vector does not matter. When you do Cosine Sim. this is ok but it is IMO not ok when you apply LogisticRegression.

IMO the embeddings should be normalized to unit length before LogisticRegression is applied. That would be done by passing normalize_embeddings=True to the encode function. See here:

https://github.com/UKPLab/sentence-transformers/blob/0422a5e07a5a998948721dea435235b342a9f610/sentence_transformers/SentenceTransformer.py#L111-L118

What do you think? I can provide a PR if wanted.

Oct 30 '22 11:10 PhilipMay

That explains those messages i'm getting : when using fit and trying to predict :

_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result(

`df_ri_manclassif['predicted']= model(df_ri_manclassif['global_text'].to_list()) this is the message : --------------------------------------------------------------------------- NotFittedError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_7856/2981536964.py in ----> 1 df_ri_manclassif['predicted']= model(df_ri_manclassif['global_text'].to_list())

c:\Users\doub2420\AppData\Local\Programs\Python\Python39\lib\site-packages\setfit\modeling.py in call(self, inputs) 60 def call(self, inputs): 61 embeddings = self.model_body.encode(inputs) ---> 62 return self.model_head.predict(embeddings) 63 64 def _save_pretrained(self, save_directory):

c:\Users\doub2420\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\linear_model_base.py in predict(self, X) 445 Vector containing the class labels for each sample. 446 """ --> 447 scores = self.decision_function(X) 448 if len(scores.shape) == 1: 449 indices = (scores > 0).astype(int)

c:\Users\doub2420\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\linear_model_base.py in decision_function(self, X) 425 this class would be predicted. 426 """ --> 427 check_is_fitted(self) 428 429 X = self._validate_data(X, accept_sparse="csr", reset=False) ... -> 1345 raise NotFittedError(msg % {"name": type(estimator).name}) 1346 1347`

Oct 30 '22 17:10 doubianimehdi

Well, I made a branch where I can enable normalize.

I did some tests with default settings, a normal BERT model (no pretrained sentence embedding model) and optuna.

Letting optuna optimize also the normalize parameter (True or False) shows that it is definitly better when NOT doing it.

This is strange since it means that the length of the vector seems to encode important information...

See here:

Oct 31 '22 12:10 PhilipMay

@lewtun what is your opinion on this? Do you have more insights?

Nov 01 '22 11:11 PhilipMay

Seems like SetFit already has something like this but only hidden in a script and not the main package...

https://github.com/huggingface/setfit/blob/43dbaf1a914a08ff8ef8ec836ddd51586d7881bb/scripts/setfit/run_fewshot.py#L118

Nov 04 '22 12:11 PhilipMay

this is implemented via #177 - closing this

Dec 01 '22 20:12 PhilipMay