FriendsQA icon indicating copy to clipboard operation
FriendsQA copied to clipboard

Correct way for data split

Open hsujh073 opened this issue 4 years ago • 4 comments

Hi. I want to run some codes with FriendsQA dataset and find out that in the JSON file episode 21-22 are the test set and those after 23 are the development set, different from that written in README.

So which one is the correct way to split the dataset? Thanks.

hsujh073 avatar Aug 21 '20 08:08 hsujh073

@hsujh073 I believe 21-22 should be the development set and 23+ are the test set as written in the paper: https://www.aclweb.org/anthology/2020.acl-main.505.pdf

@FrankLicm could you please verify this and fix the typos if any? Thanks.

jdchoi77 avatar Aug 21 '20 16:08 jdchoi77

Hi, I am sorry that I think I may make a mistake when naming the generated split files before so I actually forgot which set I used to get the result in the paper, but the correct way I originally proposed is indeed 21-22 should be the development set and 23+ are the test set. Besides, this data split is generated from full data for version 1.0 when uploading it to make it consistent with the version 1.0, and due to my previous laptop issue, I lost the original data split files when I did experiments for which I did some deletion of some invalid questions and the development environment for this now is also lost so I am afraid that I cannot do any further operations regarding this repo. The typo here, I think, is only that the name of dev and test files of both versions 1.0 and 2.0 should be exchanged. Thanks.

arianakc avatar Aug 22 '20 08:08 arianakc

OK. Thank you.

hsujh073 avatar Aug 23 '20 09:08 hsujh073

@FrankLicm you still have the access to this repo, so please fix the names when you have time. Thanks!

jdchoi77 avatar Aug 24 '20 15:08 jdchoi77