setfit
setfit copied to clipboard
Support for the differentiable head
This pull request add supports as mentioned in the issue #8.
- added
SetFitHead
, which inherited from Sentence Transformers' Dense to make APIs consistent - integrated
SetFitHead
toSetFitModel
- integrated
SetFitHead
toSetFitTrainer
(sklearn-based head still works and usage remains the same) - added new tests to test
SetFitHead
(tested initialization for single/multiple targets, forward, backward) - added new APIs for
SetFitTrainer
:trainer.freeze()
andtrainer.unfreeze(keep_body_frozen)
Absolutely incredible work on implementing a pure
torch
model @blakechi 🔥 !I've left a few comments / questions, but overall this is looking really good.
Would you mind sharing a code snippet in the PR description so that others can understand how the new API should work?
I'm also curious if you tested this implementation with some of the test datasets in our paper? It would be cool to know if (a) this implementation does better than our original one and (b) there are no subtle regressions with the
sklearn
version.
Thanks! Glad you like it!
Sure, will provide a snippet in the next comment. :)
Sorry I might rush a bit. I wanted to share with you the implementation to make sure the APIs are correct, so I only tested the head by running one step (forward and backward) to check its gradients. I will test this implementation on the test datasets. If you have some suggested scripts for me to run, I'm happy to know!
Here is the snippet for using the differentiable head (partially copied from README.md
):
from datasets import load_dataset
from sentence_transformers.losses import CosineSimilarityLoss
from setfit import SetFitModel, SetFitTrainer
# Load a dataset from the Hugging Face Hub
dataset = load_dataset("sst2")
# Simulate the few-shot regime by sampling 8 examples per class
num_classes = 2
train_dataset = dataset["train"].shuffle(seed=42).select(range(8 * num_classes))
# Initialize `SetFitModel` with the differentiable head
model = SetFitModel.from_pretrained(
"sentence-transformers/paraphrase-albert-small-v2",
use_differentiable_head=True,
head_params={"out_features": num_classes},
)
# Create trainer
trainer = SetFitTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss_class=CosineSimilarityLoss,
metric="accuracy",
batch_size=16,
num_iterations=20, # The number of text pairs to generate for contrastive learning
num_epochs=1, # The number of epochs to use for constrastive learning
column_mapping={"sentence": "text", "label": "label"} # Map dataset columns to text/label expected by trainer
)
# Freeze head
trainer.freeze()
# Do contrastive training
trainer.train(num_epochs=1)
# Unfreeze head
trainer.unfreeze()
# Unfreeze head and freeze body
# trainer.unfreeze(keep_body_frozen=True)
# Train end-to-end
trainer.train(num_epochs=1)
Hey @blakechi sorry for the delay in reviewing your latest changes! They're looking really good and there's just a few things we need to do before merging:
- Resolve the merge conflicts (from some unrelated PRs)
- Verify that running
scripts/setfit/run_fewshot.py
reproduces the results from our paper when using thesklearn
head (I can do this) - [Optional] Check if this PR can produce similar results as the
sklearn
head. We could just run a few experiments in Colab for e.g. theemotion
dataset to see if we're in the same ballpark
How does that sound?
Hi @lewtun,
Sorry for the late. I was packed by other stuff. Ya, that sounds great to me. :)
-
Sure, just resolved it!
-
Thanks! I also ran experiments using
sklearn
head so that we can double check it with yours and mine. :) -
I ran some experiments on the test set using
pytorch
head with different epochs and they have similar performance on the paper. Please see the table below. -
Updated
run_fewshot.py
for using the differentiable head by--classifier=pytorch
Results that mimics Table 2 - N=8 in the paper (all pytorch heads use batch size = 16
, optimizer = AdamW
, L2 weight (weight decay) = 0
, head learning rate = 1e-2
, body learning rate = 1e-5
):
Head | SST-5 | Amazon-CF | CR | Emotion | EnronSpam | AGNews |
---|---|---|---|---|---|---|
sklearn | 43.9 (2.8) | 40.2 (9.3) | 88.2 (2.5) | 48.4 (4.6) | 89.6 (4.0) | 82.8 (3.0) |
pytorch (freeze body, 25 epochs) | 43.9 (3.0) | 40.7 (12.6) | 88.8 (1.2) | 48.6 (4.0) | 88.6 (4.7) | 82.7 (2.8) |
pytorch (freeze body, 50 epochs) | 44.4 (2.9) | 39.9 (11.7) | 89.0 (1.0) | 48.4 (5.2) | 89.1 (4.1) | 83.3 (2.9) |
pytorch (end to end, 25 epochs) | 43.6 (2.2) | 40.6 (12.2) | 88.6 (1.4) | 46.3 (4.7) | 89.9 (3.6) | 83.0 (2.9) |
pytorch (end to end, 50 epochs) | 43.0 (2.8) | 39.1 (11.6) | 88.8 (1.3) | 47.0 (3.1) | 89.3 (3.8) | 83.3 (2.7) |
I also tried on SGD
and Adam
, and kept other parameters as same as the above ones. For SGD
, the performance dropped significantly (e.g. accuracy dropped to 1X.XX for emotion
), so I excluded it here. The below is the results using Adam
:
Head | SST-5 | Amazon-CF | CR | Emotion | EnronSpam | AGNews |
---|---|---|---|---|---|---|
pytorch (freeze body, 25 epochs) | 42.9 (3.0) | 41.9 (12.4) | 88.5 (1.2) | 48.8 (5.4) | 89.9 (3.7) | 83.2 (2.8) |
pytorch (end to end, 25 epochs) | 43.1 (3.1) | 41.6 (11.8) | 88.6 (1.4) | 48.7 (4.3) | 89.4 (4.5) | 83.4 (2.4) |
Good job @blakechi ! Many thanks.
Thanks for the final iteration - this looks great so I'm going to merge it now 🔥 !
Amazing contribution and thank you for working on it @blakechi 🤗
I want to thank you for your advice and review as well, those make this PR wonderful! Really like this collaboration! 🤗 @lewtun