datasets Batched mapping does not raise an error if values for an existing column are empty

Batched mapping does not raise an error if values for an existing column are empty

Open felix-schneider opened this issue 9 months ago • 0 comments

Describe the bug

Using Dataset.map(fn, batched=True) allows resizing the dataset by returning a dict of lists, all of which must be the same size. If they are not the same size, an error like pyarrow.lib.ArrowInvalid: Column 1 named x expected length 1 but got length 0 is raised.

This is not the case if the function returns an empty list for an existing column in the dataset. In that case, the dataset is silently resized to 0 rows.

Steps to reproduce the bug

MWE:

import datasets
data = datasets.Dataset.from_dict({"test": [1]})

def mapping_fn(examples):
    return {"test": [], "y": [1]}

data = data.map(mapping_fn, batched=True)
print(len(data))

Note that when returning "x": [], the error is raised correctly, also when returning "test": [1,2].

Expected behavior

Expected an exception: pyarrow.lib.ArrowInvalid: Column 1 named test expected length 1 but got length 0 or pyarrow.lib.ArrowInvalid: Column 2 named y expected length 0 but got length 1.

Any exception would be acceptable.

Environment info

datasets version: 2.19.1
Platform: Linux-5.4.0-153-generic-x86_64-with-glibc2.31
Python version: 3.11.8
huggingface_hub version: 0.22.2
PyArrow version: 15.0.2
Pandas version: 2.2.1
fsspec version: 2024.2.0

May 07 '24 11:05 felix-schneider

datasets datasets copied to clipboard

Batched mapping does not raise an error if values for an existing column are empty

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

datasets
datasets copied to clipboard