data collator or tokenizer.pad has bug when add new features to data
System Info
transformers 4.26.1, mac m1, python 3.9.13
Who can help?
No response
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
import torch
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
if 'new_feature' not in tokenizer.model_input_names:
tokenizer.model_input_names.append('new_feature')
samples = [
{'input_ids': torch.arange(3), 'new_feature': torch.arange(8)},
{'input_ids': torch.arange(5), 'new_feature': torch.arange(11)},
]
batch = tokenizer.pad(samples)
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True'
'truncation=True' to have batched tensors with the same length. Perhaps your features (new_feature in this case)
have excessive nesting (inputs type list where type int is expected).
Expected behavior
no error. batch['input_ids'].shape == (2, 5) batch['new_feature'].shape == (2, 11)
Hi @lanlanlan3 - thanks for opening this issue.
The reason this error is being thrown is that "new_feature" won't be padded and therefore the tensors can't be concatenated to create a batch. This can be seen if the inputs passed are lists and the return type not specified:
>>> samples = [
... {'input_ids': list(range(8)), 'new_feature': list(range(3))},
... {'input_ids': list(range(11)), 'new_feature': list(range(5))},
... ]
>>> batch = tokenizer.pad(samples, max_length=12, padding='max_length', return_tensors=None)
{'input_ids': [[0, 1, 2, 3, 4, 5, 6, 7, 0, 0, 0, 0], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0]], 'new_feature': [[0, 1, 2], [0, 1, 2, 3, 4]]}
This occurs for a few reasons:
- All input features are expected to be padded by the same amount per sample. The amount of padding needed is calculated based on the padding strategy and the length of the
input_ids. For example, ifpadding='max_length', then for sample 0, the padding to be added is calculated asmax_length - sample_length = 5 - 3 = 2for all features (input_idsandnew_feature). Howevernew_featuresisn't padded at all because of the next point. - The padding behaviour for
"new_feature"is undefined i.e. what should the sequence be padded with? You can see how this is controlled in the padding internals here.
This behaviour from the tokenizer is expected.
Note: model_input_names defines the expected inputs to the model during the forward pass. Therefore changing this will mean that the tokenizer outputs aren't in the expected format for a model in the transformers library. To modify it, it should be passed when creating the tokenizer, rather than modifying the class attribute directly:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', model_input_names=["input_ids", "new_feature"])
If the outputs of the tokenizer are being passed to a custom model that ingests input_ids and new_feature, and the model expects them to be of different length, then I would suggest defining your own tokenizer class which subclasses PreTrainedTokenizer or BertTokenizer; or a custom data collator which performs the expected padding behaviour.
Hi @lanlanlan3 - thanks for opening this issue.
The reason this error is being thrown is that
"new_feature"won't be padded and therefore the tensors can't be concatenated to create a batch. This can be seen if the inputs passed are lists and the return type not specified:>>> samples = [ ... {'input_ids': list(range(8)), 'new_feature': list(range(3))}, ... {'input_ids': list(range(11)), 'new_feature': list(range(5))}, ... ] >>> batch = tokenizer.pad(samples, max_length=12, padding='max_length', return_tensors=None) {'input_ids': [[0, 1, 2, 3, 4, 5, 6, 7, 0, 0, 0, 0], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0]], 'new_feature': [[0, 1, 2], [0, 1, 2, 3, 4]]}This occurs for a few reasons:
- All input features are expected to be padded by the same amount per sample. The amount of padding needed is calculated based on the padding strategy and the length of the
input_ids. For example, ifpadding='max_length', then for sample 0, the padding to be added is calculated asmax_length - sample_length = 5 - 3 = 2for all features (input_idsandnew_feature). Howevernew_featuresisn't padded at all because of the next point.- The padding behaviour for
"new_feature"is undefined i.e. what should the sequence be padded with? You can see how this is controlled in the padding internals here.This behaviour from the tokenizer is expected.
Note:
model_input_namesdefines the expected inputs to the model during the forward pass. Therefore changing this will mean that the tokenizer outputs aren't in the expected format for a model in the transformers library. To modify it, it should be passed when creating the tokenizer, rather than modifying the class attribute directly:tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', model_input_names=["input_ids", "new_feature"])If the outputs of the tokenizer are being passed to a custom model that ingests
input_idsandnew_feature, and the model expects them to be of different length, then I would suggest defining your own tokenizer class which subclassesPreTrainedTokenizerorBertTokenizer; or a custom data collator which performs the expected padding behaviour.
