CRQDA Poor results on Squad 1.0

Hi, I wanted to augment the squad 1.0 dataset (not unanswerable questions). I trained a standard roberta MRC model using the transformers library which was giving 86.16 and 92.31 Exact match and F1 scores on the validation data. I also trained an autoencoder for 100 epochs as described and the loss came down to about 0.04 with perfect regeneration.

I then tried running crqda to augment 30000 samples. I removed the "NEG" parameter, and added "SPAN = True" and "para". With the same hyperparameters (epsilon) that you used. Out of 30000 samples, only 1800 samples had questions generated which were selected (the jaccard was >= 0.3).

After manual inspection, I see that most of the generated questions are gibberish (especially if jaccard similarity is <= 0.8). Can you share some insights on what might be going wrong and why the results are so poor given the MRC and the autoencoder models are training perfectly?

Any help would be greatly appreciated! Thanks!

May 09 '21 23:05 rishabhjoshi

Hi, I wanted to augment the squad 1.0 dataset (not unanswerable questions). I trained a standard roberta MRC model using the transformers library which was giving 86.16 and 92.31 Exact match and F1 scores on the validation data. I also trained an autoencoder for 100 epochs as described and the loss came down to about 0.04 with perfect regeneration.

I then tried running crqda to augment 30000 samples. I removed the "NEG" parameter, and added "SPAN = True" and "para". With the same hyperparameters (epsilon) that you used. Out of 30000 samples, only 1800 samples had questions generated which were selected (the jaccard was >= 0.3).

After manual inspection, I see that most of the generated questions are gibberish (especially if jaccard similarity is <= 0.8). Can you share some insights on what might be going wrong and why the results are so poor given the MRC and the autoencoder models are training perfectly?

Any help would be greatly appreciated! Thanks!

Here are some suggestions that might help you:

"the generated questions are gibberish ". -> Your autoencoder might be overfitted. You could try to pretrain the autoencoder on a large-scale Wikipedia dataset
"Out of 30000 samples, only 1800 samples had questions generated which were selected (the jaccard was >= 0.3)." -> It seems the modified step size used in the inference stage is too large, you can adjust this hyper-parameter.
You can also try to set "SPAN = False" and only add "para".
When using the augmented data set to finetune the MRC model, you can try to adjust the hyperparameter of "warmup steps", due to the increase in the amount of training data.

May 10 '21 01:05 dayihengliu

Hi, 1) we trained the autoencoder on the 2M corpus that you provide replacing the "train_file" in this script https://github.com/dayihengliu/CRQDA/blob/master/crqda/run_train.sh. 2) I will experiment with lower step sizes. Personally, I don't see why the hyperparameter for step size would vary for SQuAD 1.0 dataset considering SQuAD 2 is just SQuAD 1 with some unanswerable questions. However, I will still experiment with more step sizes. 3) I tried SPAN = False as well, just keeping para and still poor results. 4) I have not reached this step as the number of augmented datapoints is too low (1000 out of possible 30000 attempts).

Would it be possible for you to release your trained autoencoder? Also, would it be possible to release the augmented dataset including other samples (not just unanswerable)?

Thanks!

May 13 '21 03:05 rishabhjoshi

Hi, 1) we trained the autoencoder on the 2M corpus that you provide replacing the "train_file" in this script https://github.com/dayihengliu/CRQDA/blob/master/crqda/run_train.sh. 2) I will experiment with lower step sizes. Personally, I don't see why the hyperparameter for step size would vary for SQuAD 1.0 dataset considering SQuAD 2 is just SQuAD 1 with some unanswerable questions. However, I will still experiment with more step sizes. 3) I tried SPAN = False as well, just keeping para and still poor results. 4) I have not reached this step as the number of augmented datapoints is too low (1000 out of possible 30000 attempts).

Would it be possible for you to release your trained autoencoder? Also, would it be possible to release the augmented dataset including other samples (not just unanswerable)?

Thanks!

This work was done during my internship at Microsoft, but I have left Microsoft. So far, I can only find the augmented unanswered questions and the well-trained RoBERTa SQuAD 2.0 MRC model. Regarding the autoencoder, you can refer to https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT#quick-start-guide to download and preprocess the Wikipedia dataset.

May 15 '21 04:05 dayihengliu

@rishabhjoshi Hi, Do you solve this problem? I want to improve it base on crqda, but Judging from your description, maybe the code is hard to run, if you solve it, can you share the augmented squad1.1 data (answerable) for me? thanks!

May 31 '21 01:05 TingFree

@TingFree I was not able to reproduce the results for Squad 1.0 dataset. I was hoping to get the MRC model and the autoencoder (although, the MRC model and autoencoder I have trained are pretty good). I did try multiple hyperparameters but could never get as good results as the authors got for Squad 2.

Jul 22 '21 22:07 rishabhjoshi

@rishabhjoshi Hi, Have you reproduce CRQDA on Squad2? I mean the same results as paper

Apr 16 '22 14:04 TingFree

CRQDA CRQDA copied to clipboard

Poor results on Squad 1.0

CRQDA
CRQDA copied to clipboard