postgresml icon indicating copy to clipboard operation
postgresml copied to clipboard

[feature] Support SetFit: few-shot fine-tuning of Sentence Transformers, works with about 8 samples per class.

Open stargazer33 opened this issue 4 months ago • 0 comments

It would be great to add SetFit to postgresml

See https://pypi.org/project/setfit/ https://huggingface.co/docs/setfit

SetFit for text classification is different from other libraries: Usually, to train/fine-tune a model you need thousands of samples per class. In this example https://postgresml.org/docs/open-source/pgml/guides/llms/fine-tuning the "train" part of IMDB dataset contains 25K rows. There are 2 classes, so 12500 samples per class.

Now I'm quoting the SetFit documentation

It achieves high accuracy with little labeled data - for instance, with only 8 labeled examples per class on the Customer Reviews sentiment dataset

The code where they train a classifier - again they classifying film reviews (nothing really new here) is here https://huggingface.co/docs/setfit/main/quickstart#training

the sample_dataset function will sample only 8 samples for each class.

Compare this: 12500 samples per class vs 8 samples per class with SetFit

In the real life, in many cases, you can collect... 50 samples per class and use SetFit to train a model. Situations where you have tens of thousands of samples are quite rare. Let's support SetFit.

stargazer33 avatar Oct 08 '24 22:10 stargazer33