pytorch-seq2seq icon indicating copy to clipboard operation
pytorch-seq2seq copied to clipboard

torchtext recent version (0.12.0) doesn't support Field, BucketIterator

Open manik2304 opened this issue 2 years ago • 1 comments

The recent version of torchtext 0.12.0 doesn't support Field, BuckeIterator, etc. What is the equivalent modules to pre-process the datasets like Multi30k, IWSLT2016, IWSLT2017 etc? Thanks.

manik2304 avatar Apr 23 '22 12:04 manik2304

I use torchtext with version = 0.11 solves the problem. conda install pytorch torchtext=0.11 cudatoolkit=11.3 -c pytorch

johnnyhwu avatar May 01 '22 04:05 johnnyhwu

Torchtext >= 0.12 had removed Field and lagacy modules. You can try THIS :

from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

from collections import Counter
from torchtext.datasets import Multi30k
from torchtext.vocab import vocab
from torchtext.data import get_tokenizer

Jiazxu avatar Apr 01 '23 12:04 Jiazxu

@Jiazxu What to do in case of custom dataset stored as a csv file? How to load it? And then perform train validation split.

saqib-sarwar avatar Apr 01 '23 19:04 saqib-sarwar

@Jiazxu What to do in case of custom dataset stored as a csv file? How to load it? And then perform train validation split.

It can be done by the Panda Lirary. First, tansforms the .csv file to a torch.utils.data.Dataset class. The code is like (Details depend on your data content):

import pandas as pd
import torch
import copy
from torch.utils.data import DataLoader, Dataset

class xxx:
    def xxx:

        data = pd.read_csv(data_dir)
        data_tensor = torch.tensor(data.values)
        label = copy.copy(data_tensor)

    return data, label

Then you can put the DataSet_csv into the DataLoader.

Jiazxu avatar Apr 02 '23 14:04 Jiazxu