text icon indicating copy to clipboard operation
text copied to clipboard

fix_length in data.Field does not truncate the sequence

Open JTWang2000 opened this issue 3 years ago • 2 comments

I am trying to use Field and TabularDataset to process the text sequence input. To make the input be a fix length, I add the fix_length=MAX_SEQ_LEN in Field.

MAX_SEQ_LEN = 128

# Fields
label_field = Field(sequential=False, use_vocab=False, batch_first=True, dtype=torch.float)
text_field = Field(use_vocab=False, tokenize=tokenizer.encode, lower=False, 
                   include_lengths=False, batch_first=True,fix_length=MAX_SEQ_LEN)
fields = [('label', label_field), ('text', text_field)]

# TabularDataset
train, valid, test = TabularDataset.splits(path=path, train='train.csv', validation='valid.csv',
                                           test='test.csv', format='CSV', fields=fields, skip_header=True)

# Iterators
train_iter = BucketIterator(train, batch_size=16, sort_key=lambda x: len(x.text),
                            device=device, train=True, sort=True, sort_within_batch=True)
valid_iter = BucketIterator(valid, batch_size=16, sort_key=lambda x: len(x.text),
                            device=device, train=True, sort=True, sort_within_batch=True)
test_iter = Iterator(test, batch_size=16, device=device, train=False, shuffle=False, sort=False)

However, when I run the code, there is a warning:

Token indices sequence length is longer than the specified maximum sequence length for this model (262 > 256). Running this sequence through the model will result in indexing errors

I checked that fix_length should be able to truncate the data input, however, it does not work on my side.

It happens both on local side and Google colab: Local: Mac m1 Big Sur 11.3.1 pytorch: 1.8.0 torchtext: 0.6.0 python: 3.8.10

Google colab: pytorch: 1.8.1+cu101 torchtext: 0.9.1

It really bothers me!!! Looking forward to the solution! Thanks!!!

JTWang2000 avatar Jun 08 '21 03:06 JTWang2000

Hey @parmeet can I try to fix this issue ?

TejasKarkera10 avatar Aug 19 '21 07:08 TejasKarkera10

Is this even valid today ? isn't include_lengths is removed from code at all from pytorch ?

Robokishan avatar Mar 15 '23 18:03 Robokishan