uda icon indicating copy to clipboard operation
uda copied to clipboard

Number of augmented data in text classification tasks

Open haozheji opened this issue 6 years ago • 4 comments

Can you provide some more details to the augmented data used in text classification tasks? e.g. for each labeled text, how many texts do you augment using back translation and what is the total number of augmented data used in each text classification task?

Thank you!

haozheji avatar Sep 24 '19 04:09 haozheji

Hi, it depends on whether you use BERT or not. When BERT is used, the model converges to a good accuracy fairly quickly, so we only generate one paraphrase for each unlabeled data. When we use a random initialization, we generated 4 paraphrases for each unlabeled example for Amazon and Yelp based datasets and 64 paraphrases for each unlabeled example for IMDB. You can get more information about the unlabeled set here.

michaelpulsewidth avatar Sep 26 '19 04:09 michaelpulsewidth

Hi qizhex, I want to classify certain new articles into categories like regulatory news or agreement news but my labeled dataset for training is very small, can i use same approach of data augmentation for training ? Also, will i be able to use the same code for my dataset ?

tuner007 avatar Nov 01 '19 12:11 tuner007

You can use the same code for your dataset. But I am not sure if back translation would work well there.

michaelpulsewidth avatar Nov 01 '19 23:11 michaelpulsewidth

Hi, it depends on whether you use BERT or not. When BERT is used, the model converges to a good accuracy fairly quickly, so we only generate one paraphrase for each unlabeled data. When we use a random initialization, we generated 4 paraphrases for each unlabeled example for Amazon and Yelp based datasets and 64 paraphrases for each unlabeled example for IMDB. You can get more information about the unlabeled set here.

Thank you.

guotong1988 avatar Jun 05 '20 09:06 guotong1988