setfit
setfit copied to clipboard
No tutorial or guideline for Few-shot learning on multiclass text classification
I just want to use SBERT for Few Shot multiclass text classification, however I couldn't see any tutorial or explanation for it. Can you explain to me that which "multi_target_strategy" and loss function should I use for multi-class text classification ?
Hello! I'm afraid the documentation is a bit lacking on that department indeed. You can experiment with the different multi_target_strategy
options from the README, but I think "multi-output"
should be a good start. Beyond that, you don't have to override the default loss function, you can just leave it. The default is the recommended one.
- Tom Aarsen
I tried every options in README, but in every case I encountered with IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
. As it seems that those multi_target_strategy
doesn't work for multiclass. Interestingly, in case I don't use multi_target_strategy
parameter, training occurs, but the success rate is terrible.
it's probably due to your input data dimensions (I would presume your label dimension). I have it working with the multi_target_strategy parameter, though my accuracy is not really good but I am not working with a multiclass problem but a multilabel problem.
As far as I understand, multi-class refers to the setting where you predict one class out of multiple classes, whereas multi-label refers to the setting where you predict multiple labels (out of multiple classes) So if you have a multi-class setting in this sense, you would not want to enable options for multi-target.
Hi All,
I have a question on this same topic - I am working on a multi-class text classification problem. I have a couple of questions on the expected format for labels in the data -
- Do labels need to be integers?
- I understand that for binary classification, they can be 0 and 1 bit what about in case of more than 2 classes? I am working on a sentiment analysis problem with 3 classes - positive, negative and neutral. How should I format the labels in the dataset? I tried -1, 0, and 1 for negative, neutral and positive respectively but training failed with the error: "setfit IndexError: Target -1 is out of bounds."
I can really use some help. Thanks for your help!
Just to add to my last question, my problem is just a multi-class text classification problem and not a multi-label problem. One sentence/example will have only one label out of positive, negative or neutral. Thanks!
@utility-aagrawal You should one-hot encode the labels so that they are [0 or 1, 0 or 1, 0 or 1] where [negative, neutral, positive]
Eg. If a sentence is neutral then your label should be [0, 1, 0].
This should answer both of your questions.
@josh-yang92 Thanks a lot! and I don't need to use multi-target-strategy, right? That's for multilabel classification problems?
I am getting the following error after encoding my target variables as [1, 0, 0] for negative , [0, 1, 0] for neutral, [0, 0, 1] for positive. I am not using multi_target_strategy since I don't have multiple target variables.
Suggestions are welcome!
@ByUnal Were you able to make it work for a multiclass text classification problem? I would love to hear your experience with this. Thanks!
@tomaarsen Do you have any recommendations as to how to handle labels in case of multiclass text-classification? Thanks!
I am getting the following error after encoding my target variables as [1, 0, 0] for negative , [0, 1, 0] for neutral, [0, 0, 1] for positive. I am not using multi_target_strategy since I don't have multiple target variables.
Suggestions are welcome!
You could compare your data format to the format used in this example: https://github.com/huggingface/setfit/blob/main/notebooks/text-classification.ipynb
@utility-aagrawal Hello there, and sorry for the late answer. I think your problem would be solved if you use [0,1,2] as target values for [neutral, positive, negative] instead of using combination of 1 and 0. Your way works more like binary classification. You try to estimate whether any sample is positive (also same for the others) or not.
Besides, do not define any multi_target_strategy
for multi-class classification, since it didn't work in my case. I've managed to train model by this way. Hope it works. Let me know if you need help further.
@ByUnal Thanks a lot for your response! Using (0,1,2) without specifying any multi_target_strategy worked for me! I was able to train using that.
I have a couple of follow-up questions -
- I trained with 8 and 16 examples per class and accuracy is in 60s which is not bad but unfortunately not good enough for my use case. Should I experiment with more examples per class or if I can gather more training data, should I go for training from scratch/fine tuning a bigger model? Do you have experience with this?
- Do you have any tips for choosing training examples for setfit? Currently, I am randomly choosing n examples per class from my training data.
I appreciate your help!
@utility-aagrawal the more the better
in terms of data samples for each class. However, you need to do bunch of experiments, cause it depends on your data quality, data samples, classification model and so forth. I think you should observe your results and decide which way you're going to proceed. You can use confusion matrix to understand which classes are confused by model, for example.
In my case, I had really really imbalanced data, the quality was low. So, it didn't work as I expected. Anyway, you can try the followings to increase your success rate:
- Preprocessing (removing stopwords, extra whitespaces, punctuation etc.)
- Fine-tuning (and Hyper-parameter optimization, of course)
- Using bigger model (in fine-tunning)
- Increasing samples for per class (the more the better)
- If you can have much more data, maybe you won't need to deal with Few-shot Learning and thereby you can build your own network or just try pre-trained models like BERT.
This is all I can say from this point of view.
Thanks @ByUnal ! I'll give that a try.