setfit
setfit copied to clipboard
Choosing the datapoints that need to be annotated?
Hello,
I have a large set of unlabelled data on which I need to do text classification. Since few-shot text classification uses only a handful of datapoints per class, is there a systematic way to choose which datapoints should be chosen for annotation?
Thank you!
Hello!
I'm not aware of any systematic approaches that can help out by any of the data labeling tools. I'm not quite sure how that would work, considering the tool would likely required some prior knowledge about the text in order to show you a varied distribution of texts for you to label. I think I would recommend to use a data labelling tool, try to label as much as you feel like, and then experiment with these two approaches:
- Using e.g. 80% of all labeled data as the
train_datasetin theSetFitTrainer, with the remaining 20% as data in theeval_dataset. Note that the datasets will be unbalanced, i.e. some classes have more texts. - Using the 80-20 split again, but now on a dataset that is preprocessed such that all classes have the same number of labeled samples (e.g. with 3 classes with 9, 12 and 14 labeled samples respectively, sample 9 labeled samples from each class).
I'm unsure which of the two approaches would lead to better results, as I've only experimented with the second approach, i.e. balanced datasets.
I did not mean to close this! I hit the "Close with comment" button accidentally while typing out my response.
I'd love to hear about your findings about the two approaches.
- Tom Aarsen
Thank you Tom! I will give it a try. Meanwhile, I came across small-text, an active learning library, and they have shown an example of how to use a querying strategy for text classification. Have you tried anything like that with success?
Although I am familiar with small-text, I have not tried their querying strategy. I'm looking at what I presume is the example now, and it looks quite interesting.
Hi, Please also see in the following notebook how Argila implements active learning based SetFit + Small-text. Small-Text allows you to activate different strategies for selecting new queries (select samples) from the un-labeled data based on SetFit prediction
https://colab.research.google.com/github/webis-de/small-text/blob/main/examples/notebooks/03-active-learning-with-setfit.ipynb#scrollTo=2184e4b7
Argilla has a small-text + SetFit tutorial on their docs site, too: https://docs.argilla.io/en/latest/tutorials/notebooks/training-textclassification-smalltext-activelearning.html
I would go with a method that maximizes diversity of samples, start annotating, and seeing performance from there. I've had good results with "discriminative active learning" and it is simple to implement it and use it with SetFit.
@chschroeder is quite an authority on the topic of active learning (in context of SetFit, among others). Perhaps he has a moment of time to recommend a querying strategy.
Thanks for the ping, Tom!
Depends on the problem. Do you know how many classes your dataset will have @vahuja4? How many samples does your dataset contain?
Without having that information: the advice of diversity-based strategies sounds reasonable here. While I have evaluated and even documented uncertainty-based and coreset strategies in combination with setfit, I did not try discriminative active learning, so I cannot add anything here. I have used discriminative active learning in other contexts though and it is also implemented in small-text.
@kgourgou - thank you! I will give it a shot. @chschroeder - thank you for your reply! The number of classes is 74 and the size of the corpus is around a million. The corpus isn't labelled yet. Any advice would be much appreciated!
@kgourgou @chschroeder - while the topic of AL is very fascinating, there seem to be a lot of unknowns. For example, do we keep changing the querying strategy for every AL loop? And, how do we decide the number of datapoints to be labelled per loop - is it fixed or variable? It seems that the selection step itself could be framed as an ML problem where we can have various querying strategies acting like features and an ML model can be used to optimize the weights of the features..what do you think?
small-text looks amazing and fits a use-case I have!
@vahuja4 I expect that @chschroeder can give more precise answers than me, but in general it depends on how easy it is to annotate your data, what is the label quality, and how many labels you already have.
Typically you would start with a diversity strategy, get some labels to start with, then, if you know what kind of model you want to use, it usually doesn't hurt to go with an uncertainty-based method to pick the next samples; they are competitive on most benchmarks.
Something that I have found useful is to combine active learning with weak labelling, e.g., getting some heuristics for the labels down as functions and using Snorkel to construct initial weak labels that you can refine once you get a few more strong labels.
@kgourgou - thank you! I will give it a shot. @chschroeder - thank you for your reply! The number of classes is 74 and the size of the corpus is around a million. The corpus isn't labelled yet. Any advice would be much appreciated!
With 74 classes it might still be possible to provide 2-3 samples per class as a starting point (depending how long your documents are / how costly the annotation is). Are the classes mutually exclusive? Do some classes occur much more frequently than the others?
@kgourgou @chschroeder - while the topic of AL is very fascinating, there seem to be a lot of unknowns. For example, do we keep changing the querying strategy for every AL loop?
You are completely right here, and since this spans a really large hyperparameter space (even without your dataset), there is currently no one-size-fits-all approach, but there are things you can do wrong (i.e. not considering skewed class distributions). Changing query strategies in between loops is possible, but I don't think I have not seen any evaluations on this recently.
What is your goal with this? 1) Maximum performance, i.e. even the last 0,5% in accuracy are important or 2) maximum efficiency, i.e. as long as the resulting model is "good enough" you just want to minimize the labeling efforts. Unless it is 1), you can opt for any strategy that is expected to do well.
A practical thing to keep in mind is turnaround time. If you choose a rather expensive query strategy the waiting time in between two iterations might nullify any gains in reduced labeling costs.
And, how do we decide the number of datapoints to be labelled per loop - is it fixed or variable? It seems that the selection step itself could be framed as an ML problem where we can have various querying strategies acting like features and an ML model can be used to optimize the weights of the features..what do you think?
The query size is variable. In theory it would be best to update after each sample. Since model training can be time consuming this is most often done in batches instead, and the number of samples you choose is a tradeoff between classification performance (or rather maximum information obtained from the model) and runtime. In your case, I would pick something between 20 and 74. If you pick a computationally cheap model and query strategy or if you don't mind the waiting times, you might also go lower than that.
small-text looks amazing and fits a use-case I have!
@vahuja4 I expect that @chschroeder can give more precise answers than me, but in general it depends on how easy it is to annotate your data, what is the label quality, and how many labels you already have.
Typically you would start with a diversity strategy, get some labels to start with, then, if you know what kind of model you want to use, it usually doesn't hurt to go with an uncertainty-based method to pick the next samples; they are competitive on most benchmarks.
Something that I have found useful is to combine active learning with weak labelling, e.g., getting some heuristics for the labels down as functions and using Snorkel to construct initial weak labels that you can refine once you get a few more strong labels.
Thank you! Happy to hear that.
Totally agree with your advice, except that I can give more precise answers ;). In the end these pointwise evaluations from the benchmarks is all we have right now and there is no "best" strategy.
Starting with diversity and switching to uncertainty can be a good way to control exploration versus exploitation. If there is any scientific interest behind it, I would be afraid to that since you would need to argue at which point you switched strategies, but if the goal is just a viable dataset in the end, this is a good solution.
@chschroeder congrats on the paper it is excellent work and a significant contribution. AUC is a good idea when comparing few-shot models. Still, in practice (as a data scientist) I will be more interested in What is the minimum number of instances I have to annotate to get x% of the max performance (full training)? For example, x% can be 99%. Is it possible to add such a metric?
@chschroeder Is there are a few simple rules of thumb to choose the best strategy for a given data set, or should I try all of them? For example, I would like to apply an emotion dataset (6 classes).
Thank you, @MosheWasserb! I am honored you already noticed. SetFit has been working great for me, thanks for that as well. Such a metric seems reasonable but then how do you set x? Moreover, there is always the decision of maximum performance vs. maximum efficiency for active learning. This is an interesting thought though, the active learning evaluation procedure is far from perfect (but also difficult to improve).
@chschroeder Is there are a few simple rules of thumb to choose the best strategy for a given data set, or should I try all of them? For example, I would like to apply an emotion dataset (6 classes).
For active learning? I am somewhat biased on keeping the runtime manageable for the experimental scenario since I have many queries and repetitions that quickly increase the runtime. For this, uncertainty-based strategies have worked remarkably well. This was before SetFit, however, and in my SetFit experiments embedding-based strategies seemed viable again. I suspect that the trained embedding space is more favorable.
If you are thinking about few-shot learning here though, I would probably also go for an embedding-based strategy and try to cover the vector space intelligently (i.e., with a few representative points over all the classes that lie not too close to each other).
Also, @MosheWasserb, in case the mention of the 6-class emotion dataset is not just a toy example: I have recently been informed that this dataset is not ideal for various reasons. See for example this article. It's a bit sensationalistic, but it gets the point across.