djl icon indicating copy to clipboard operation
djl copied to clipboard

New Bert example using Goemotions

Open pmg1991 opened this issue 5 years ago • 7 comments

Can we have a new Bert example using Goemotions? https://github.com/google-research/google-research/tree/master/goemotions

Colab or Binder notebook would be awesome for Multi-label classification example including training and inference .

pmg1991 avatar Oct 05 '20 12:10 pmg1991

@pmg1991 This is indeed a nice example for BERT. We are trying to adding more NLP examples. We will priority this request.

At the mean time, are you interested in contributing by adding this dataset to DJL?

frankfliu avatar Oct 06 '20 21:10 frankfliu

@frankfliu Sure , I'd like to contribute.

pmg1991 avatar Oct 07 '20 10:10 pmg1991

@pmg1991 I create a CsvDataset in https://github.com/awslabs/djl/pull/208

You should be able to extends CsvDataset and create a Goemotions, you can use https://github.com/awslabs/djl/blob/master/basicdataset/src/main/java/ai/djl/basicdataset/AmesRandomAccess.java as an example.

frankfliu avatar Oct 14 '20 02:10 frankfliu

Hi @zachgk, I'm interested in this issue and I want to work on it, so I wonder if you can assign it to me? Thanks!

Konata-CG avatar Apr 17 '22 12:04 Konata-CG

Yeah, here you go @Konata-CG

zachgk avatar Apr 17 '22 16:04 zachgk

I found this dataset contains several raw datasets and processed datasets. They described the processed datasets as below: "The data we used for training the models includes examples where there is an agreement between at least 2 raters. Our data includes 43,410 training examples (train.tsv), 5426 dev examples (dev.tsv) and 5427 test examples (test.tsv)." I wonder which datasets should I use? raw or processed.

Konata-CG avatar Apr 19 '22 05:04 Konata-CG

I would recommend the processed data. One of the big problems when working with datasets is that the data is often very noisy. In this example, one source of noise would be that some examples can't be clearly classified to an emotion. So, the processed one where they remove the data that isn't suitable for the task saves everyone who uses your Dataset class from having to do the same processing themselves.

Then, you would have the three train/validate/test .tsv files to correspond to the different DJL dataset Usages

zachgk avatar Apr 20 '22 00:04 zachgk