postgresml
postgresml copied to clipboard
[feature] Support SetFit: few-shot fine-tuning of Sentence Transformers, works with about 8 samples per class.
It would be great to add SetFit to postgresml
See https://pypi.org/project/setfit/ https://huggingface.co/docs/setfit
SetFit for text classification is different from other libraries: Usually, to train/fine-tune a model you need thousands of samples per class. In this example https://postgresml.org/docs/open-source/pgml/guides/llms/fine-tuning the "train" part of IMDB dataset contains 25K rows. There are 2 classes, so 12500 samples per class.
Now I'm quoting the SetFit documentation
It achieves high accuracy with little labeled data - for instance, with only 8 labeled examples per class on the Customer Reviews sentiment dataset
The code where they train a classifier - again they classifying film reviews (nothing really new here) is here https://huggingface.co/docs/setfit/main/quickstart#training
the sample_dataset function will sample only 8 samples for each class.
Compare this: 12500 samples per class vs 8 samples per class with SetFit
In the real life, in many cases, you can collect... 50 samples per class and use SetFit to train a model. Situations where you have tens of thousands of samples are quite rare. Let's support SetFit.