datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Batched mapping does not raise an error if values for an existing column are empty

Open felix-schneider opened this issue 9 months ago • 0 comments

Describe the bug

Using Dataset.map(fn, batched=True) allows resizing the dataset by returning a dict of lists, all of which must be the same size. If they are not the same size, an error like pyarrow.lib.ArrowInvalid: Column 1 named x expected length 1 but got length 0 is raised.

This is not the case if the function returns an empty list for an existing column in the dataset. In that case, the dataset is silently resized to 0 rows.

Steps to reproduce the bug

MWE:

import datasets
data = datasets.Dataset.from_dict({"test": [1]})

def mapping_fn(examples):
    return {"test": [], "y": [1]}

data = data.map(mapping_fn, batched=True)
print(len(data))

Note that when returning "x": [], the error is raised correctly, also when returning "test": [1,2].

Expected behavior

Expected an exception: pyarrow.lib.ArrowInvalid: Column 1 named test expected length 1 but got length 0 or pyarrow.lib.ArrowInvalid: Column 2 named y expected length 0 but got length 1.

Any exception would be acceptable.

Environment info

  • datasets version: 2.19.1
  • Platform: Linux-5.4.0-153-generic-x86_64-with-glibc2.31
  • Python version: 3.11.8
  • huggingface_hub version: 0.22.2
  • PyArrow version: 15.0.2
  • Pandas version: 2.2.1
  • fsspec version: 2024.2.0

felix-schneider avatar May 07 '24 11:05 felix-schneider