setfit Support for the differentiable head

This pull request add supports as mentioned in the issue #8.

added SetFitHead, which inherited from Sentence Transformers' Dense to make APIs consistent
integrated SetFitHead to SetFitModel
integrated SetFitHead to SetFitTrainer (sklearn-based head still works and usage remains the same)
added new tests to test SetFitHead (tested initialization for single/multiple targets, forward, backward)
added new APIs for SetFitTrainer: trainer.freeze() and trainer.unfreeze(keep_body_frozen)

Oct 19 '22 04:10 blakechi

Absolutely incredible work on implementing a pure torch model @blakechi 🔥 !

I've left a few comments / questions, but overall this is looking really good.

Would you mind sharing a code snippet in the PR description so that others can understand how the new API should work?

I'm also curious if you tested this implementation with some of the test datasets in our paper? It would be cool to know if (a) this implementation does better than our original one and (b) there are no subtle regressions with the sklearn version.

Thanks! Glad you like it!

Sure, will provide a snippet in the next comment. :)

Sorry I might rush a bit. I wanted to share with you the implementation to make sure the APIs are correct, so I only tested the head by running one step (forward and backward) to check its gradients. I will test this implementation on the test datasets. If you have some suggested scripts for me to run, I'm happy to know!

Oct 23 '22 06:10 blakechi

Here is the snippet for using the differentiable head (partially copied from README.md):

from datasets import load_dataset
from sentence_transformers.losses import CosineSimilarityLoss

from setfit import SetFitModel, SetFitTrainer


# Load a dataset from the Hugging Face Hub
dataset = load_dataset("sst2")

# Simulate the few-shot regime by sampling 8 examples per class
num_classes = 2
train_dataset = dataset["train"].shuffle(seed=42).select(range(8 * num_classes))

# Initialize `SetFitModel` with the differentiable head
model = SetFitModel.from_pretrained(
    "sentence-transformers/paraphrase-albert-small-v2",
    use_differentiable_head=True,
    head_params={"out_features": num_classes},
)

# Create trainer
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    metric="accuracy",
    batch_size=16,
    num_iterations=20, # The number of text pairs to generate for contrastive learning
    num_epochs=1, # The number of epochs to use for constrastive learning
    column_mapping={"sentence": "text", "label": "label"} # Map dataset columns to text/label expected by trainer
)

# Freeze head
trainer.freeze()

# Do contrastive training
trainer.train(num_epochs=1)

# Unfreeze head
trainer.unfreeze()

# Unfreeze head and freeze body
# trainer.unfreeze(keep_body_frozen=True)

# Train end-to-end
trainer.train(num_epochs=1)

Oct 23 '22 06:10 blakechi

Hey @blakechi sorry for the delay in reviewing your latest changes! They're looking really good and there's just a few things we need to do before merging:

Resolve the merge conflicts (from some unrelated PRs)
Verify that running scripts/setfit/run_fewshot.py reproduces the results from our paper when using the sklearn head (I can do this)
[Optional] Check if this PR can produce similar results as the sklearn head. We could just run a few experiments in Colab for e.g. the emotion dataset to see if we're in the same ballpark

How does that sound?

Oct 28 '22 10:10 lewtun

Hi @lewtun,

Sorry for the late. I was packed by other stuff. Ya, that sounds great to me. :)

Sure, just resolved it!
Thanks! I also ran experiments using sklearn head so that we can double check it with yours and mine. :)
I ran some experiments on the test set using pytorch head with different epochs and they have similar performance on the paper. Please see the table below.
Updated run_fewshot.py for using the differentiable head by --classifier=pytorch

Results that mimics Table 2 - N=8 in the paper (all pytorch heads use batch size = 16, optimizer = AdamW, L2 weight (weight decay) = 0, head learning rate = 1e-2, body learning rate = 1e-5):

Head	SST-5	Amazon-CF	CR	Emotion	EnronSpam	AGNews
sklearn	43.9 (2.8)	40.2 (9.3)	88.2 (2.5)	48.4 (4.6)	89.6 (4.0)	82.8 (3.0)
pytorch (freeze body, 25 epochs)	43.9 (3.0)	40.7 (12.6)	88.8 (1.2)	48.6 (4.0)	88.6 (4.7)	82.7 (2.8)
pytorch (freeze body, 50 epochs)	44.4 (2.9)	39.9 (11.7)	89.0 (1.0)	48.4 (5.2)	89.1 (4.1)	83.3 (2.9)
pytorch (end to end, 25 epochs)	43.6 (2.2)	40.6 (12.2)	88.6 (1.4)	46.3 (4.7)	89.9 (3.6)	83.0 (2.9)
pytorch (end to end, 50 epochs)	43.0 (2.8)	39.1 (11.6)	88.8 (1.3)	47.0 (3.1)	89.3 (3.8)	83.3 (2.7)

I also tried on SGD and Adam, and kept other parameters as same as the above ones. For SGD, the performance dropped significantly (e.g. accuracy dropped to 1X.XX for emotion), so I excluded it here. The below is the results using Adam:

Head	SST-5	Amazon-CF	CR	Emotion	EnronSpam	AGNews
pytorch (freeze body, 25 epochs)	42.9 (3.0)	41.9 (12.4)	88.5 (1.2)	48.8 (5.4)	89.9 (3.7)	83.2 (2.8)
pytorch (end to end, 25 epochs)	43.1 (3.1)	41.6 (11.8)	88.6 (1.4)	48.7 (4.3)	89.4 (4.5)	83.4 (2.4)

Oct 29 '22 08:10 blakechi

Good job @blakechi ! Many thanks.

Nov 01 '22 16:11 PhilipMay

Thanks for the final iteration - this looks great so I'm going to merge it now 🔥 !

Amazing contribution and thank you for working on it @blakechi 🤗

I want to thank you for your advice and review as well, those make this PR wonderful! Really like this collaboration! 🤗 @lewtun

Nov 02 '22 05:11 blakechi

setfit setfit copied to clipboard

Support for the differentiable head

setfit
setfit copied to clipboard