COVID-QA icon indicating copy to clipboard operation
COVID-QA copied to clipboard

Data Augmentation

Open andra-pumnea opened this issue 4 years ago • 4 comments

Experiment with different methods for data augmentation, report results and compare to baseline.

andra-pumnea avatar Mar 21 '20 10:03 andra-pumnea

I will check later on the back translation

borhenryk avatar Mar 21 '20 10:03 borhenryk

There is a possibility to use PPDB to generate additional paraphrased questions: http://paraphrase.org/#/download

stedomedo avatar Mar 22 '20 08:03 stedomedo

Any updates on creating more questions?

Maybe @HenrykBorzymowski can use the MS Azure translator here for backtranslation? They have free 2M chars per month I heard : )

Timoeller avatar Mar 23 '20 18:03 Timoeller

I have tried the google/uda project (https://github.com/google-research/uda). It has a back-translation part that allows you to take existing sentences, translate them into French and then back into English with different temperature parameters which will increase the sample size of the existing dataset.

Unfortunately the repository is quite outdated and the packages with the given versions do not work anymore.

Please install these packages (with python==2.7) and then follow the instructions in the UDA readme file to make it work:

pip install tensorflow-gpu====1.15.2
install pip tensor2tensor==1.15.2
pip install tensorflow probability==0.7.0

The following command translates the provided sample file in the directory back_translate (google/uda). It automatically divides paragraphs into sentences, translates English sentences into French, and then translates them back into English. Go to the back_translate directory and execute it:

download bash.sh
bash run.sh
  • download.sh will download the translation model
  • run.sh performs the back_translation with a certain temperature. (def. 0.9)

I tried some temperature settings (0.3, 0.5, 0.7, 0.9) for the eval_question_similarity_en.csv table and found that rather small temperatures work better for our case (0.3 or 0.5). With 0.7 and 0.9 we get quite a lot of random translations :D

Attached you will find the results if someone is interested :) This could help us to get more variance in our sentences and to be less dependent on certain words that appear in our training set.

eval_question_similarity_back_trans.xlsx

borhenryk avatar Mar 25 '20 10:03 borhenryk