transformers icon indicating copy to clipboard operation
transformers copied to clipboard

DataCollatorWithFlattening only accepts labels as list

Open romitjain opened this issue 3 weeks ago • 3 comments

System Info

  • transformers version: 4.55.4
  • Platform: Linux-5.14.0-284.73.1.el9_2.x86_64-x86_64-with-glibc2.39
  • Python version: 3.12.3
  • Huggingface_hub version: 0.36.0
  • Safetensors version: 0.5.2
  • Accelerate version: 1.12.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.8.0+cu128 (CUDA)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?: no
  • GPU type: NVIDIA A100-SXM4-80GB

Who can help?

@ArthurZucker @Cyrilvallez

Information

  • [ ] The official example scripts
  • [x] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

# test.py
import torch
from transformers import DataCollatorWithFlattening

def repro_tensor_labels():
    features = [
        {
            "input_ids": torch.tensor([1, 2, 3, 4]),
            "labels": torch.tensor([10, 11, 12, 13]),
        },
        {
            "input_ids": torch.tensor([5, 6, 7]),
            "labels": torch.tensor([14, 15, 16]),
        },
    ]
    collator = DataCollatorWithFlattening(return_tensors="pt")
    return collator(features)

def repro_list_labels():
    features = [
        {
            "input_ids": torch.tensor([1, 2, 3, 4]),
            "labels": [10, 11, 12, 13],
        },
        {
            "input_ids": torch.tensor([5, 6, 7]),
            "labels": [14, 15, 16],
        },
    ]
    collator = DataCollatorWithFlattening(return_tensors="pt")
    return collator(features)

print("labels as tensors")
try:
    batch = repro_tensor_labels()
    print("batch:", batch)
except Exception as e:
    print("got error:", repr(e))

print("\nlabels as list")
batch = repro_list_labels()
for k, v in batch.items():
    print(k, v)
python test.py

This gives

labels as tensors
got error: TypeError('can only concatenate list (not "Tensor") to list')

labels as list
input_ids tensor([[1, 2, 3, 4, 5, 6, 7]])
labels tensor([[-100,   11,   12,   13, -100,   15,   16]])
position_ids tensor([[0, 1, 2, 3, 0, 1, 2]])

Expected behavior

The current implementation of DataCollatorWithFlattening only works with labels provided as lists. Can we support labels as torch.Tensor?

I can create a PR for the fix.

romitjain avatar Dec 01 '25 07:12 romitjain

cc @vasqu here, I don't remember, is this expected to fail or not? Do we only accept list inputs?

Cyrilvallez avatar Dec 08 '25 10:12 Cyrilvallez

Kinda expected as we do not expect a certain datatype beforehand, i.e. "primitive" python. It also does not hurt to support those cases tho, it's a nice qol improvement (seeing already #42620 adding this cc @Rocketknight1)

vasqu avatar Dec 08 '25 13:12 vasqu

That was my logic too - I wasn't sure if it was supposed to work with tensors or not, but if we can easily support it without breaking other code then we probably should.

Rocketknight1 avatar Dec 08 '25 16:12 Rocketknight1