setfit
setfit copied to clipboard
Using SetFit Embeddings for Semantic Search?
Hi,
I was wondering if the semantic search would improve if one would train a multilabel-classification model and use those embeddings?
After training a binary classification model I have seen that the embeddings between similar topics on all-MiniLM-L12-v2
vs all-MiniLM-L12-v2-setfit
(fitted model) are very close in fitted model which makes sense for me.
# Cosine Similarity
def get_cosine_similarity(vector1, vector2):
sim = 1 - spatial.distance.cosine(vector1, vector2)
return sim
word_1 = "acne"
word_2 = "red skin"
emb_fit_1 = model.model_body.encode([word_1])
emb_fit_2 = model.model_body.encode([word_2])
emb_base_1 = model_sbert.encode([word_1])
emb_base_2 = model_sbert.encode([word_2])
print(f"{word_1} vs {word_2} (base)", get_cosine_similarity(emb_base_1, emb_base_2))
print(f"{word_1} vs {word_2} (fit)", get_cosine_similarity(emb_fit_1, emb_fit_2))
acne vs pimple (base) 0.5959747433662415
acne vs pimple (fit) 0.9996786117553711
acne vs red skin (base) 0.36421263217926025
acne vs red skin (fit) 0.9994498491287231
acne vs red car (base) 0.17558744549751282
acne vs red car (fit) 0.0051751588471233845
I would assume that if the model is trained on multi-label-classification task the embeddings would somehow clustered based on the labels which are provided during training. Would that improve the semantic search if enough labels are provided during training?
Of course I could train a model and test it but maybe you have done similar tests and already know if it's working or not :-)
Thanks!
I am very interested in this topic too - planning to use only the fine-tuning part and use the embeddings for semantic search. Any thoughts?
I have reduced the dimensions with UMAP and visualized the embeddings of the training set with all-MiniLM-L12-v2
vs all-MiniLM-L12-v2-setfit
(fitted model). Then I just highlighted every text which includes "acne" and "pimple". The green ones are which do not include "acne" or "pimple". The actual task was a binary classification if a text is related to skincare or not.
It looks like that the model "learned" that "acne" and "pimple" are very close. Their embeddings are closer on average after fitting the model with the training data. I did not calculate the average distance of those embeddings but from a visual point they should be closer together.
That tells me that even after binary classification the embeddings could be used improving the semantic search. I'll do another test with a multi-label classification but creating the training set needs some data wrangling. When I've found some time to do test, I'll post the results here.
This is super neat! Thanks for sharing the UMAP comparison @Raidus!
Tangential question, are you uploading your model to the HF hub or you storing the fine-tuned model locally and then calling it to get the embeddings?
Very interesting experimental results. Out of curiosity, the model_sbert
/all-MiniLM-L12-v2
SentenceTransformer is not finetuned on the data, right?
Hi, How I can train the model Setfit model for semantic search assuming I don't have labeled data ( let's say I have product descriptions) then how I can use the trainer Setfit trainer to create positive and negative samples, as per the hugging face blog it needs a few labels to train right? (Correct me if I am wrong) Please, help to understand the process of how I can just use the product description to train setfit model and use that on my queries for semantic search
Thanks
Hi, How I can train the model Setfit model for semantic search assuming I don't have labeled data ( let's say I have product descriptions) then how I can use the trainer Setfit trainer to create positive and negative samples, as per the hugging face blog it needs a few labels to train right? (Correct me if I am wrong) Please, help to understand the process of how I can just use the product description to train setfit model and use that on my queries for semantic search
Thanks
I have the same question. How can I finetune the embedding model for my RAG. I need a fast way on my custom dataset