DataCollatorWithFlattening only accepts labels as list
System Info
transformersversion: 4.55.4- Platform: Linux-5.14.0-284.73.1.el9_2.x86_64-x86_64-with-glibc2.39
- Python version: 3.12.3
- Huggingface_hub version: 0.36.0
- Safetensors version: 0.5.2
- Accelerate version: 1.12.0
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.8.0+cu128 (CUDA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?:
- Using GPU in script?: no
- GPU type: NVIDIA A100-SXM4-80GB
Who can help?
@ArthurZucker @Cyrilvallez
Information
- [ ] The official example scripts
- [x] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
# test.py
import torch
from transformers import DataCollatorWithFlattening
def repro_tensor_labels():
features = [
{
"input_ids": torch.tensor([1, 2, 3, 4]),
"labels": torch.tensor([10, 11, 12, 13]),
},
{
"input_ids": torch.tensor([5, 6, 7]),
"labels": torch.tensor([14, 15, 16]),
},
]
collator = DataCollatorWithFlattening(return_tensors="pt")
return collator(features)
def repro_list_labels():
features = [
{
"input_ids": torch.tensor([1, 2, 3, 4]),
"labels": [10, 11, 12, 13],
},
{
"input_ids": torch.tensor([5, 6, 7]),
"labels": [14, 15, 16],
},
]
collator = DataCollatorWithFlattening(return_tensors="pt")
return collator(features)
print("labels as tensors")
try:
batch = repro_tensor_labels()
print("batch:", batch)
except Exception as e:
print("got error:", repr(e))
print("\nlabels as list")
batch = repro_list_labels()
for k, v in batch.items():
print(k, v)
python test.py
This gives
labels as tensors
got error: TypeError('can only concatenate list (not "Tensor") to list')
labels as list
input_ids tensor([[1, 2, 3, 4, 5, 6, 7]])
labels tensor([[-100, 11, 12, 13, -100, 15, 16]])
position_ids tensor([[0, 1, 2, 3, 0, 1, 2]])
Expected behavior
The current implementation of DataCollatorWithFlattening only works with labels provided as lists.
Can we support labels as torch.Tensor?
I can create a PR for the fix.
cc @vasqu here, I don't remember, is this expected to fail or not? Do we only accept list inputs?
Kinda expected as we do not expect a certain datatype beforehand, i.e. "primitive" python. It also does not hurt to support those cases tho, it's a nice qol improvement (seeing already #42620 adding this cc @Rocketknight1)
That was my logic too - I wasn't sure if it was supposed to work with tensors or not, but if we can easily support it without breaking other code then we probably should.