pytorch-openai-transformer-lm icon indicating copy to clipboard operation
pytorch-openai-transformer-lm copied to clipboard

How is the file "cloze_test_test__spring2016 - cloze_test_ALL_test.csv" created?

Open luffycodes opened this issue 6 years ago • 5 comments

Downloading the dataset from the website comprise different filenames, none of which matches this particular filename. Can you please elaborate as to how this file is created - like merging the train & test & val files? Preferably the filename of those files. Thanks !

luffycodes avatar Sep 12 '18 15:09 luffycodes

You need to export the google sheet to a csv file (from https://docs.google.com/spreadsheets/d/1FkdPMd7ZEw_Z38AsFSTzgXeiJoLdLyXY_0B_0JIJIbw/edit#gid=81257118 and https://docs.google.com/spreadsheets/d/11tfmMQeifqP-Elh74gi2NELp0rx9JMMjnQ_oyGKqCEg/edit#gid=410941117).

artemisart avatar Sep 13 '18 09:09 artemisart

Thanks ! So, the model is not trained on the entire dataset "ROCStories__spring2016 - ROCStories_spring2016.csv"?

luffycodes avatar Sep 13 '18 15:09 luffycodes

According to the datasets.py file, it's trained on 1497 examples from 'cloze_test_val__spring2016 - cloze_test_ALL_val.csv', validated on 374 examples from the same file, and tested on 'cloze_test_test__spring2016 - cloze_test_ALL_test.csv'.

artemisart avatar Sep 14 '18 09:09 artemisart

It looks like catastrophically small dataset for deep learning model, isn't it? I have heard that good start to get adequate model is 1GB of text data. How does it work?

Belerafon avatar Oct 09 '18 20:10 Belerafon

The idea of the OpenAI paper is to use a pretrained network and transfer what it knows about language to another task. By doing this, you can obtain really good results with a small dataset.

rodgzilla avatar Oct 10 '18 07:10 rodgzilla