uda
uda copied to clipboard
Number of augmented data in text classification tasks
Can you provide some more details to the augmented data used in text classification tasks? e.g. for each labeled text, how many texts do you augment using back translation and what is the total number of augmented data used in each text classification task?
Thank you!
Hi, it depends on whether you use BERT or not. When BERT is used, the model converges to a good accuracy fairly quickly, so we only generate one paraphrase for each unlabeled data. When we use a random initialization, we generated 4 paraphrases for each unlabeled example for Amazon and Yelp based datasets and 64 paraphrases for each unlabeled example for IMDB. You can get more information about the unlabeled set here.
Hi qizhex, I want to classify certain new articles into categories like regulatory news or agreement news but my labeled dataset for training is very small, can i use same approach of data augmentation for training ? Also, will i be able to use the same code for my dataset ?
You can use the same code for your dataset. But I am not sure if back translation would work well there.
Hi, it depends on whether you use BERT or not. When BERT is used, the model converges to a good accuracy fairly quickly, so we only generate one paraphrase for each unlabeled data. When we use a random initialization, we generated 4 paraphrases for each unlabeled example for Amazon and Yelp based datasets and 64 paraphrases for each unlabeled example for IMDB. You can get more information about the unlabeled set here.
Thank you.