text icon indicating copy to clipboard operation
text copied to clipboard

Specified Field dtype <torchtext.legacy.data.pipeline.Pipeline object at ...> can not be used with use_vocab=False because we do not know how to numericalize it.

Open MSiba opened this issue 3 years ago • 3 comments

❓ Questions and Help

Description

I am trying to implement a sequence (multi-output) regression task using torchtext, but I am getting the error in the title.

torch version: 1.10.1 torchtext version: 0.11.1

Here's how I proceed:

Given. sequential data (own data) of the form:

   text    label
    'w1'    '[0.1, 0.3, 0.1]' 
    'w2'    '[0.74, 0.4, 0.65]'  
    'w3'    '[0.21, 0.56, 0.23]' 
<empty line denoting the beginning of a new sentence>
    ...       ...

TorchText Fields to read this data. (works perfectly)

import torchtext
from torchtext.legacy import data
from torchtext.legacy import datasets


TEXT = data.Field(use_vocab=True,  #  use torchtext.vocab, and later on, numericalization based on pre-trained vectors
                              lower=True)

LABEL = data.Field(is_target=True,
                   use_vocab=False, # I don't think that I need a vocab for my task, because the output is a list of doubles 
                   unk_token=None,
                   preprocessing=data.Pipeline(
                       lambda x: torch.tensor(list(map(float, removeBracets(x).split(' '))),
                                              dtype=torch.double)),      # I implement this Pipeline to transform labels from string(list(doubles)) to torch.Tensor(doubles)
                   dtype=torch.DoubleTensor)  # the label is a tensor of doubles

fields = [("text",TEXT) , ("label",LABEL)]

Since I have sequential data, I used datasets.SequenceTaggingDataset to split the data into training, validation and testing sets.

train, valid, test = datasets.SequenceTaggingDataset.splits(path='./data/',
                                                                                              train = train_path,
                                                                                              validation = validate_path,
                                                                                              test = test_path,
                                                                                              fields=fields)

Then, I use a pre-trained embedding to build the vocab for the TEXT Field, e.g.

TEXT.build_vocab(train, vectors="glove.840B.300d")

After that, I use BucketIterator to create batches of the training data efficiently.

train_iterator, valid_iterator = data.BucketIterator.splits(
                                                        (train, valid),
                                                        device=DEVICE,
                                                        batch_size=BATCH_SIZE,
                                                        sort_key=lambda x: len(x.text),
                                                        repeat=False,
                                                        sort=True) # for validation/testing, better set it to False

Everything works perfectly till now. However, when I try to iterate over train_iterator,

batch = next(iter(train_iterator))
print("text", batch.text)
print("label", batch.label)

I get the following error:

    229         """
    230         padded = self.pad(batch)
--> 231         tensor = self.numericalize(padded, device=device)
    232         return tensor
    233 

PATH_TO\torchtext\legacy\data\field.py in numericalize(self, arr, device)
    340                     "use_vocab=False because we do not know how to numericalize it. "
    341                     "Please raise an issue at "
--> 342                     "https://github.com/pytorch/text/issues".format(self.dtype))
    343             numericalization_func = self.dtypes[self.dtype]
    344             # It doesn't make sense to explicitly coerce to a numeric type if

ValueError: Specified Field dtype <torchtext.legacy.data.pipeline.Pipeline object at 0x0XXXXXXXX> can not be used with use_vocab=False because we do not know how to numericalize it. Please raise an issue at https://github.com/pytorch/text/issues

I looked into the question #609. Unlike this issue, I need to find a numericalization for the labels, which are of the form list(torch.DoubleTensor). Do you have any suggestion?

MSiba avatar Feb 04 '22 16:02 MSiba

Hello, I also encountered a similar problem, have you solved this problem?

zouziwei1998 avatar Apr 17 '22 07:04 zouziwei1998

Hi @zouziwei1998, I re-implemented it in PyTorch, including all the steps that torch text usually offers. In particular, I used DataLoader to iterate through the training data. It worked!

MSiba avatar Apr 17 '22 08:04 MSiba

Thank you for your reply! Actually I've been stuck with this problem for a few days. Because I'm not familiar with pytorch, I didn't use DataSet and DataLoader way at first. Can I take a look at how your code solves this problem? I would like to learn how to handle this situation. Thank you very much.

zouziwei1998 avatar Apr 17 '22 08:04 zouziwei1998