PyTorch-NLP icon indicating copy to clipboard operation
PyTorch-NLP copied to clipboard

Add GLUE datasets

Open PetrochukM opened this issue 6 years ago • 7 comments

GLUE datasets are standard for evaluating NLU tasks.

In pursuit of this objective, we introduce the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks.

PetrochukM avatar Apr 27 '18 18:04 PetrochukM

Hi, I am a Belgian student in computer engineering, I am following an introduction course about open source. One of my goal this semester is to make a contribution to a project. My master thesis will be related to NLP, this is why this project interest me. Is there a way I could help fixing this issue? (or maybe another issue related to this project)

PattynR avatar Nov 09 '18 21:11 PattynR

Hi There!

Yeah, please fix this issue! GLUE datasets are a popular suite of datasets for evaluating NLP models. It'd be nice if there was support for those datasets. This issue should be an easy one to get started with.

Recently, I was at Belgium for EMNLP 2018. One of the best NLP conferences in the world.

PetrochukM avatar Nov 10 '18 00:11 PetrochukM

Hey, so bad I missed the EMNLP! This is the first year I work on NLP, and I had never heard about those conferences, I hope I'll be able to go there next year. About the issue, could you please confirm that my job is to add a new file into the torchnlp/datasets folder? A file that would be named "glue.py". I guess this is what I have to do, but I would prefer to be completely sure!

PattynR avatar Nov 18 '18 10:11 PattynR

Yeah that'd work!

PetrochukM avatar Nov 18 '18 16:11 PetrochukM

Hi, I'm almost done, for the moment it works for all the datasets of GLUE except for QQP and SNLI. There is an issue with those files that I don't know how to handle ... When I load the QQP and SNLI datasets, there are some lines in the files themselves that doesn't have the right amount of parameters. Here is an example to illustrate what I mean.

On the first line of each downloaded file, we can find the names of the different features of the tsv file. In the 'train.tsv' file of SNLI for example, there should be 11 features per line. There are however a lot of lines (38.656 in total) where there are more than 10 tabs, so more than 11 features ....

For the moment I decided not to add those lines in the Dataset object, but I know this is not what should be done. I've looked on the internet to find a meaning to those lines, but there is not a lot of documentation about QQP and SNLI.

So do you maybe know what I should do? Or should I add my file to the project, and create a new issue? Someone that has already worked with those datasets should be able to fix it easily.

Thanks.

PattynR avatar Dec 08 '18 11:12 PattynR

Thanks for your attempt at contributing this function: https://github.com/PetrochukM/PyTorch-NLP/pull/60 :)

PetrochukM avatar Jul 04 '20 03:07 PetrochukM

Hey! I want to give this a try. Is there any way that I can do it still? It seems like it's too late to contribute to this project.

karish-grover avatar Aug 29 '21 13:08 karish-grover