djl
djl copied to clipboard
New Bert example using Goemotions
Can we have a new Bert example using Goemotions? https://github.com/google-research/google-research/tree/master/goemotions
Colab or Binder notebook would be awesome for Multi-label classification example including training and inference .
@pmg1991 This is indeed a nice example for BERT. We are trying to adding more NLP examples. We will priority this request.
At the mean time, are you interested in contributing by adding this dataset to DJL?
@frankfliu Sure , I'd like to contribute.
@pmg1991 I create a CsvDataset in https://github.com/awslabs/djl/pull/208
You should be able to extends CsvDataset and create a Goemotions, you can use https://github.com/awslabs/djl/blob/master/basicdataset/src/main/java/ai/djl/basicdataset/AmesRandomAccess.java as an example.
Hi @zachgk, I'm interested in this issue and I want to work on it, so I wonder if you can assign it to me? Thanks!
Yeah, here you go @Konata-CG
I found this dataset contains several raw datasets and processed datasets. They described the processed datasets as below: "The data we used for training the models includes examples where there is an agreement between at least 2 raters. Our data includes 43,410 training examples (train.tsv), 5426 dev examples (dev.tsv) and 5427 test examples (test.tsv)." I wonder which datasets should I use? raw or processed.
I would recommend the processed data. One of the big problems when working with datasets is that the data is often very noisy. In this example, one source of noise would be that some examples can't be clearly classified to an emotion. So, the processed one where they remove the data that isn't suitable for the task saves everyone who uses your Dataset class from having to do the same processing themselves.
Then, you would have the three train/validate/test .tsv files to correspond to the different DJL dataset Usages