datasets
datasets copied to clipboard
Batched mapping does not raise an error if values for an existing column are empty
Describe the bug
Using Dataset.map(fn, batched=True)
allows resizing the dataset by returning a dict of lists, all of which must be the same size. If they are not the same size, an error like pyarrow.lib.ArrowInvalid: Column 1 named x expected length 1 but got length 0
is raised.
This is not the case if the function returns an empty list for an existing column in the dataset. In that case, the dataset is silently resized to 0 rows.
Steps to reproduce the bug
MWE:
import datasets
data = datasets.Dataset.from_dict({"test": [1]})
def mapping_fn(examples):
return {"test": [], "y": [1]}
data = data.map(mapping_fn, batched=True)
print(len(data))
Note that when returning "x": []
, the error is raised correctly, also when returning "test": [1,2]
.
Expected behavior
Expected an exception: pyarrow.lib.ArrowInvalid: Column 1 named test expected length 1 but got length 0
or pyarrow.lib.ArrowInvalid: Column 2 named y expected length 0 but got length 1
.
Any exception would be acceptable.
Environment info
-
datasets
version: 2.19.1 - Platform: Linux-5.4.0-153-generic-x86_64-with-glibc2.31
- Python version: 3.11.8
-
huggingface_hub
version: 0.22.2 - PyArrow version: 15.0.2
- Pandas version: 2.2.1
-
fsspec
version: 2024.2.0