Normalize before using LogisticRegression
Hi,
as far as I can see is that setfit applies LogisticRegression on top of the output of the sentence transformer model. See here:
https://github.com/huggingface/setfit/blob/7735e8e3b208edb8dfb549beb16e585453c5f44e/src/setfit/modeling.py#L49-L55
The problem I see is that the output is not normalized by default. Since we use Cosine Sim. to compare embeddings the length of the vector does not matter. When you do Cosine Sim. this is ok but it is IMO not ok when you apply LogisticRegression.
IMO the embeddings should be normalized to unit length before LogisticRegression is applied. That would be done by passing
normalize_embeddings=True to the encode function. See here:
https://github.com/UKPLab/sentence-transformers/blob/0422a5e07a5a998948721dea435235b342a9f610/sentence_transformers/SentenceTransformer.py#L111-L118
What do you think? I can provide a PR if wanted.
That explains those messages i'm getting : when using fit and trying to predict :
_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
`df_ri_manclassif['predicted']= model(df_ri_manclassif['global_text'].to_list()) this is the message : --------------------------------------------------------------------------- NotFittedError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_7856/2981536964.py in ----> 1 df_ri_manclassif['predicted']= model(df_ri_manclassif['global_text'].to_list())
c:\Users\doub2420\AppData\Local\Programs\Python\Python39\lib\site-packages\setfit\modeling.py in call(self, inputs) 60 def call(self, inputs): 61 embeddings = self.model_body.encode(inputs) ---> 62 return self.model_head.predict(embeddings) 63 64 def _save_pretrained(self, save_directory):
c:\Users\doub2420\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\linear_model_base.py in predict(self, X) 445 Vector containing the class labels for each sample. 446 """ --> 447 scores = self.decision_function(X) 448 if len(scores.shape) == 1: 449 indices = (scores > 0).astype(int)
c:\Users\doub2420\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\linear_model_base.py in decision_function(self, X) 425 this class would be predicted. 426 """ --> 427 check_is_fitted(self) 428 429 X = self._validate_data(X, accept_sparse="csr", reset=False) ... -> 1345 raise NotFittedError(msg % {"name": type(estimator).name}) 1346 1347`
Well, I made a branch where I can enable normalize.
I did some tests with default settings, a normal BERT model (no pretrained sentence embedding model) and optuna.
Letting optuna optimize also the normalize parameter (True or False) shows that it is definitly better when NOT doing it.
This is strange since it means that the length of the vector seems to encode important information...
See here:
@lewtun what is your opinion on this? Do you have more insights?
Seems like SetFit already has something like this but only hidden in a script and not the main package...
https://github.com/huggingface/setfit/blob/43dbaf1a914a08ff8ef8ec836ddd51586d7881bb/scripts/setfit/run_fewshot.py#L118
this is implemented via #177 - closing this