setfit icon indicating copy to clipboard operation
setfit copied to clipboard

Choosing the datapoints that need to be annotated?

Open vahuja4 opened this issue 2 years ago • 19 comments

Hello,

I have a large set of unlabelled data on which I need to do text classification. Since few-shot text classification uses only a handful of datapoints per class, is there a systematic way to choose which datapoints should be chosen for annotation?

Thank you!

vahuja4 avatar Feb 16 '23 05:02 vahuja4

Hello!

I'm not aware of any systematic approaches that can help out by any of the data labeling tools. I'm not quite sure how that would work, considering the tool would likely required some prior knowledge about the text in order to show you a varied distribution of texts for you to label. I think I would recommend to use a data labelling tool, try to label as much as you feel like, and then experiment with these two approaches:

  1. Using e.g. 80% of all labeled data as the train_dataset in the SetFitTrainer, with the remaining 20% as data in the eval_dataset. Note that the datasets will be unbalanced, i.e. some classes have more texts.
  2. Using the 80-20 split again, but now on a dataset that is preprocessed such that all classes have the same number of labeled samples (e.g. with 3 classes with 9, 12 and 14 labeled samples respectively, sample 9 labeled samples from each class).

I'm unsure which of the two approaches would lead to better results, as I've only experimented with the second approach, i.e. balanced datasets.

tomaarsen avatar Feb 16 '23 07:02 tomaarsen

I did not mean to close this! I hit the "Close with comment" button accidentally while typing out my response.

I'd love to hear about your findings about the two approaches.

  • Tom Aarsen

tomaarsen avatar Feb 16 '23 07:02 tomaarsen

Thank you Tom! I will give it a try. Meanwhile, I came across small-text, an active learning library, and they have shown an example of how to use a querying strategy for text classification. Have you tried anything like that with success?

vahuja4 avatar Feb 16 '23 09:02 vahuja4

Although I am familiar with small-text, I have not tried their querying strategy. I'm looking at what I presume is the example now, and it looks quite interesting.

tomaarsen avatar Feb 16 '23 09:02 tomaarsen

Hi, Please also see in the following notebook how Argila implements active learning based SetFit + Small-text. Small-Text allows you to activate different strategies for selecting new queries (select samples) from the un-labeled data based on SetFit prediction

https://colab.research.google.com/github/webis-de/small-text/blob/main/examples/notebooks/03-active-learning-with-setfit.ipynb#scrollTo=2184e4b7

MosheWasserb avatar Feb 16 '23 18:02 MosheWasserb

Argilla has a small-text + SetFit tutorial on their docs site, too: https://docs.argilla.io/en/latest/tutorials/notebooks/training-textclassification-smalltext-activelearning.html

tomaarsen avatar Feb 16 '23 18:02 tomaarsen

I would go with a method that maximizes diversity of samples, start annotating, and seeing performance from there. I've had good results with "discriminative active learning" and it is simple to implement it and use it with SetFit.

kgourgou avatar Mar 04 '23 19:03 kgourgou

@chschroeder is quite an authority on the topic of active learning (in context of SetFit, among others). Perhaps he has a moment of time to recommend a querying strategy.

tomaarsen avatar Mar 04 '23 21:03 tomaarsen

Thanks for the ping, Tom!

Depends on the problem. Do you know how many classes your dataset will have @vahuja4? How many samples does your dataset contain?

Without having that information: the advice of diversity-based strategies sounds reasonable here. While I have evaluated and even documented uncertainty-based and coreset strategies in combination with setfit, I did not try discriminative active learning, so I cannot add anything here. I have used discriminative active learning in other contexts though and it is also implemented in small-text.

chschroeder avatar Mar 04 '23 22:03 chschroeder

@kgourgou - thank you! I will give it a shot. @chschroeder - thank you for your reply! The number of classes is 74 and the size of the corpus is around a million. The corpus isn't labelled yet. Any advice would be much appreciated!

vahuja4 avatar Mar 05 '23 08:03 vahuja4

@kgourgou @chschroeder - while the topic of AL is very fascinating, there seem to be a lot of unknowns. For example, do we keep changing the querying strategy for every AL loop? And, how do we decide the number of datapoints to be labelled per loop - is it fixed or variable? It seems that the selection step itself could be framed as an ML problem where we can have various querying strategies acting like features and an ML model can be used to optimize the weights of the features..what do you think?

vahuja4 avatar Mar 05 '23 08:03 vahuja4

small-text looks amazing and fits a use-case I have!

@vahuja4 I expect that @chschroeder can give more precise answers than me, but in general it depends on how easy it is to annotate your data, what is the label quality, and how many labels you already have.

Typically you would start with a diversity strategy, get some labels to start with, then, if you know what kind of model you want to use, it usually doesn't hurt to go with an uncertainty-based method to pick the next samples; they are competitive on most benchmarks.

Something that I have found useful is to combine active learning with weak labelling, e.g., getting some heuristics for the labels down as functions and using Snorkel to construct initial weak labels that you can refine once you get a few more strong labels.

kgourgou avatar Mar 05 '23 19:03 kgourgou

@kgourgou - thank you! I will give it a shot. @chschroeder - thank you for your reply! The number of classes is 74 and the size of the corpus is around a million. The corpus isn't labelled yet. Any advice would be much appreciated!

With 74 classes it might still be possible to provide 2-3 samples per class as a starting point (depending how long your documents are / how costly the annotation is). Are the classes mutually exclusive? Do some classes occur much more frequently than the others?

@kgourgou @chschroeder - while the topic of AL is very fascinating, there seem to be a lot of unknowns. For example, do we keep changing the querying strategy for every AL loop?

You are completely right here, and since this spans a really large hyperparameter space (even without your dataset), there is currently no one-size-fits-all approach, but there are things you can do wrong (i.e. not considering skewed class distributions). Changing query strategies in between loops is possible, but I don't think I have not seen any evaluations on this recently.

What is your goal with this? 1) Maximum performance, i.e. even the last 0,5% in accuracy are important or 2) maximum efficiency, i.e. as long as the resulting model is "good enough" you just want to minimize the labeling efforts. Unless it is 1), you can opt for any strategy that is expected to do well.

A practical thing to keep in mind is turnaround time. If you choose a rather expensive query strategy the waiting time in between two iterations might nullify any gains in reduced labeling costs.

And, how do we decide the number of datapoints to be labelled per loop - is it fixed or variable? It seems that the selection step itself could be framed as an ML problem where we can have various querying strategies acting like features and an ML model can be used to optimize the weights of the features..what do you think?

The query size is variable. In theory it would be best to update after each sample. Since model training can be time consuming this is most often done in batches instead, and the number of samples you choose is a tradeoff between classification performance (or rather maximum information obtained from the model) and runtime. In your case, I would pick something between 20 and 74. If you pick a computationally cheap model and query strategy or if you don't mind the waiting times, you might also go lower than that.

chschroeder avatar Mar 05 '23 21:03 chschroeder

small-text looks amazing and fits a use-case I have!

@vahuja4 I expect that @chschroeder can give more precise answers than me, but in general it depends on how easy it is to annotate your data, what is the label quality, and how many labels you already have.

Typically you would start with a diversity strategy, get some labels to start with, then, if you know what kind of model you want to use, it usually doesn't hurt to go with an uncertainty-based method to pick the next samples; they are competitive on most benchmarks.

Something that I have found useful is to combine active learning with weak labelling, e.g., getting some heuristics for the labels down as functions and using Snorkel to construct initial weak labels that you can refine once you get a few more strong labels.

Thank you! Happy to hear that.

Totally agree with your advice, except that I can give more precise answers ;). In the end these pointwise evaluations from the benchmarks is all we have right now and there is no "best" strategy.

Starting with diversity and switching to uncertainty can be a good way to control exploration versus exploitation. If there is any scientific interest behind it, I would be afraid to that since you would need to argue at which point you switched strategies, but if the goal is just a viable dataset in the end, this is a good solution.

chschroeder avatar Mar 05 '23 21:03 chschroeder

@chschroeder congrats on the paper it is excellent work and a significant contribution. AUC is a good idea when comparing few-shot models. Still, in practice (as a data scientist) I will be more interested in What is the minimum number of instances I have to annotate to get x% of the max performance (full training)? For example, x% can be 99%. Is it possible to add such a metric?

MosheWasserb avatar Mar 06 '23 04:03 MosheWasserb

@chschroeder Is there are a few simple rules of thumb to choose the best strategy for a given data set, or should I try all of them? For example, I would like to apply an emotion dataset (6 classes).

MosheWasserb avatar Mar 06 '23 04:03 MosheWasserb

Thank you, @MosheWasserb! I am honored you already noticed. SetFit has been working great for me, thanks for that as well. Such a metric seems reasonable but then how do you set x? Moreover, there is always the decision of maximum performance vs. maximum efficiency for active learning. This is an interesting thought though, the active learning evaluation procedure is far from perfect (but also difficult to improve).

chschroeder avatar Mar 06 '23 17:03 chschroeder

@chschroeder Is there are a few simple rules of thumb to choose the best strategy for a given data set, or should I try all of them? For example, I would like to apply an emotion dataset (6 classes).

For active learning? I am somewhat biased on keeping the runtime manageable for the experimental scenario since I have many queries and repetitions that quickly increase the runtime. For this, uncertainty-based strategies have worked remarkably well. This was before SetFit, however, and in my SetFit experiments embedding-based strategies seemed viable again. I suspect that the trained embedding space is more favorable.

If you are thinking about few-shot learning here though, I would probably also go for an embedding-based strategy and try to cover the vector space intelligently (i.e., with a few representative points over all the classes that lie not too close to each other).

chschroeder avatar Mar 06 '23 18:03 chschroeder

Also, @MosheWasserb, in case the mention of the 6-class emotion dataset is not just a toy example: I have recently been informed that this dataset is not ideal for various reasons. See for example this article. It's a bit sensationalistic, but it gets the point across.

tomaarsen avatar Mar 06 '23 20:03 tomaarsen