setfit icon indicating copy to clipboard operation
setfit copied to clipboard

Multilabel classification training data

Open josh-yang92 opened this issue 2 years ago • 5 comments

Hi guys, just have a question regarding the training data for multilabel classification.

So, for multiclass classification, you can play around with the number of samples (K) per label, and of course, the higher the K, the performance increases; this is straightforward as there will be only 1 label per sample.

However, for multilabel classification, where there can be more than 1 label per sample (also in many different combinations), how are we supposed to construct our training data? for example, would it be best to give an equal number of samples for every combination of labels? this would exponentially increase the training data required which would defeat the purpose of few-shot learning...?

I am asking this question as I have tried training the model without considering the above question and getting not-so-great results (around 65% accuracy with 12 labels).

Thank you!

josh-yang92 avatar Aug 29 '23 09:08 josh-yang92

I have used SetFit when there is a limited number of labeled training data available. If you have ample data, you can also use other frameworks. I have used the following code to balance unbalanced training data classes.

model = SetFitModel(
                    model_body=SentenceTransformer('all-MiniLM-L6-v2'), 
                    model_head=OneVsRestClassifier(LogisticRegression(class_weight="balanced")),
                    multi_target_strategy="one-vs-rest"
                    )

MattiL avatar Aug 29 '23 16:08 MattiL

@MattiL I am not sure if you understood my question correctly. I am talking about multilabel problem where the input data can have multiple labels at the same time out of many labels unlike multiclass problem where the data can have exactly one label out of many.

To balance the training data for the multiclass problem, it's easy, you just balance the data or adopt the method like you have. However, for the multilabel problem, you can have sum(nCr) combinations. So to my understanding, in order to achieve similar to the proposed result, you would have to have equal number of examples for each and every combination, which then would defeat the purpose of few-shot learning.

Hopefully I have explained myself better...

josh-yang92 avatar Aug 29 '23 18:08 josh-yang92

I have tried to use the balancing code for multilabel classification. I guess that might improve the accuracy. Multiclass has had little support in SetFit.

MattiL avatar Sep 03 '23 09:09 MattiL

You could use scikit-multilearn to create a balanced training dataset. Use iterative_train_test_split

alejandrodumas avatar Sep 24 '23 22:09 alejandrodumas

to all wondering how the data should look like. Here is a sample format

'text', 'label1', 'label2', 'label3'

  1. 'this is a sentence', 0, 0, 1
  2. 'this is a sentence2', 1, 0, 1

singularity014 avatar Oct 16 '23 19:10 singularity014