transformers data collator or tokenizer.pad has bug when add new features to data

System Info

transformers 4.26.1, mac m1, python 3.9.13

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

import torch
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
if 'new_feature' not in tokenizer.model_input_names:
    tokenizer.model_input_names.append('new_feature')
samples = [
    {'input_ids': torch.arange(3), 'new_feature': torch.arange(8)},
    {'input_ids': torch.arange(5), 'new_feature': torch.arange(11)},
]
batch = tokenizer.pad(samples)

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (new_feature in this case) have excessive nesting (inputs type list where type int is expected).

Expected behavior

no error. batch['input_ids'].shape == (2, 5) batch['new_feature'].shape == (2, 11)

Mar 14 '23 10:03 lanlanlan3

Hi @lanlanlan3 - thanks for opening this issue.

The reason this error is being thrown is that "new_feature" won't be padded and therefore the tensors can't be concatenated to create a batch. This can be seen if the inputs passed are lists and the return type not specified:

>>> samples = [
...     {'input_ids': list(range(8)), 'new_feature': list(range(3))},
...     {'input_ids': list(range(11)), 'new_feature': list(range(5))},
... ]
>>> batch = tokenizer.pad(samples, max_length=12, padding='max_length', return_tensors=None)
{'input_ids': [[0, 1, 2, 3, 4, 5, 6, 7, 0, 0, 0, 0], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0]], 'new_feature': [[0, 1, 2], [0, 1, 2, 3, 4]]}

This occurs for a few reasons:

All input features are expected to be padded by the same amount per sample. The amount of padding needed is calculated based on the padding strategy and the length of the input_ids. For example, if padding='max_length', then for sample 0, the padding to be added is calculated as max_length - sample_length = 5 - 3 = 2 for all features (input_ids and new_feature). However new_features isn't padded at all because of the next point.
The padding behaviour for "new_feature" is undefined i.e. what should the sequence be padded with? You can see how this is controlled in the padding internals here.

This behaviour from the tokenizer is expected.

Note: model_input_names defines the expected inputs to the model during the forward pass. Therefore changing this will mean that the tokenizer outputs aren't in the expected format for a model in the transformers library. To modify it, it should be passed when creating the tokenizer, rather than modifying the class attribute directly:

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', model_input_names=["input_ids", "new_feature"])

If the outputs of the tokenizer are being passed to a custom model that ingests input_ids and new_feature, and the model expects them to be of different length, then I would suggest defining your own tokenizer class which subclasses PreTrainedTokenizer or BertTokenizer; or a custom data collator which performs the expected padding behaviour.

Mar 14 '23 20:03 amyeroberts

Hi @lanlanlan3 - thanks for opening this issue.

The reason this error is being thrown is that "new_feature" won't be padded and therefore the tensors can't be concatenated to create a batch. This can be seen if the inputs passed are lists and the return type not specified:
>>> samples = [
...     {'input_ids': list(range(8)), 'new_feature': list(range(3))},
...     {'input_ids': list(range(11)), 'new_feature': list(range(5))},
... ]
>>> batch = tokenizer.pad(samples, max_length=12, padding='max_length', return_tensors=None)
{'input_ids': [[0, 1, 2, 3, 4, 5, 6, 7, 0, 0, 0, 0], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0]], 'new_feature': [[0, 1, 2], [0, 1, 2, 3, 4]]}
This occurs for a few reasons:

All input features are expected to be padded by the same amount per sample. The amount of padding needed is calculated based on the padding strategy and the length of the input_ids. For example, if padding='max_length', then for sample 0, the padding to be added is calculated as max_length - sample_length = 5 - 3 = 2 for all features (input_ids and new_feature). However new_features isn't padded at all because of the next point.

The padding behaviour for "new_feature" is undefined i.e. what should the sequence be padded with? You can see how this is controlled in the padding internals here.

This behaviour from the tokenizer is expected.

Note: model_input_names defines the expected inputs to the model during the forward pass. Therefore changing this will mean that the tokenizer outputs aren't in the expected format for a model in the transformers library. To modify it, it should be passed when creating the tokenizer, rather than modifying the class attribute directly:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', model_input_names=["input_ids", "new_feature"])
If the outputs of the tokenizer are being passed to a custom model that ingests input_ids and new_feature, and the model expects them to be of different length, then I would suggest defining your own tokenizer class which subclasses PreTrainedTokenizer or BertTokenizer; or a custom data collator which performs the expected padding behaviour.

Mar 15 '23 03:03 lanlanlan3