Semi-supervised-learning icon indicating copy to clipboard operation
Semi-supervised-learning copied to clipboard

Example for Custom Dataset Usage in NLP

Open Rasmitha23 opened this issue 2 years ago • 3 comments

🚀 Feature

How can I use custom nlp dataset to try these algorithms on? I only saw example for CV custom dataset. Main part I am intereseted in is train_transform step for NLP custom dataset.

Rasmitha23 avatar Sep 20 '22 05:09 Rasmitha23

Hi, we will add demonstration for custom nlp data. But currently only CV dataset is supported.

Currently the easiest way is to use your own Custom Dataset for NLP data and try to match the output of getitem function in your dataset as a dict:

{idx: idx, 'text': some raw text, 'text_s': some raw text}

Note that text_s is obtained by using WMT-19 translation models in fairseq by first translating it to other languages and then back-translating it.

Hhhhhhao avatar Sep 21 '22 14:09 Hhhhhhao

Hi, Thanks for replying.

Ok, I can get the data in this format, how do i run an algorithm on this format data?

Rasmitha23 avatar Sep 21 '22 14:09 Rasmitha23

You can reference the dataset we used for nlp (https://github.com/microsoft/Semi-supervised-learning/blob/main/semilearn/datasets/nlp_datasets/datasetbase.py) for your dataset.

To run the algorithms on custom dataset, you can refer this notebook (https://github.com/microsoft/Semi-supervised-learning/blob/main/notebooks/Custom_Dataset.ipynb). You only need to change the create data part, and set the net argument in config as a nlp model we supported. I think others would stay the same.

Let me know if you have further questions.

Hhhhhhao avatar Sep 21 '22 14:09 Hhhhhhao