datasets
datasets copied to clipboard
Inconsistent "The features can't be aligned" error when combining map, multiprocessing, and variable length outputs
Describe the bug
I'm using a dataset with map and multiprocessing to run a function that returned a variable length list of outputs. This output list may be empty. Normally this is handled fine, but there is an edge case that crops up when using multiprocessing. In some cases, an empty list result ends up in a dataset shard consisting of a single item. This results in a The features can't be aligned
error that is difficult to debug because it depends on the number of processes/shards used.
I've reproduced a minimal example below. My current workaround is to fill empty results with a dummy value that I filter after, but this was a weird error that took a while to track down.
Steps to reproduce the bug
import datasets
dataset = datasets.Dataset.from_list([{'idx':i} for i in range(60)])
def test_func(row, idx):
if idx==58:
return {'output': []}
else:
return {'output' : [{'test':1}, {'test':2}]}
# this works fine
test1 = dataset.map(lambda row, idx: test_func(row, idx), with_indices=True, num_proc=4)
# this fails
test2 = dataset.map(lambda row, idx: test_func(row, idx), with_indices=True, num_proc=32)
>ValueError: The features can't be aligned because the key output of features {'idx': Value(dtype='int64', id=None), 'output': Sequence(feature=Value(dtype='null', id=None), length=-1, id=None)} has unexpected type - Sequence(feature=Value(dtype='null', id=None), length=-1, id=None) (expected either [{'test': Value(dtype='int64', id=None)}] or Value("null").
The error occurs during the check
_check_if_features_can_be_aligned([dset.features for dset in dsets])
When the multiprocessing splitting lines up just right with the empty return value, one of the dset
in dsets
will have a single item with an empty list value, causing the error.
Expected behavior
Expected behavior is the result would be the same regardless of the num_proc
value used.
Environment info
Datasets version 2.11.0 Python 3.9.16
This scenario currently requires explicitly passing the target features (to avoid the error):
import datasets
...
features = dataset.features
features["output"] = = [{"test": datasets.Value("int64")}]
test2 = dataset.map(lambda row, idx: test_func(row, idx), with_indices=True, num_proc=32, features=features)
I just encountered the same error in the same situation (multiprocessing with variable length outputs).
The funny (or dangerous?) thing is, that this error only showed up when testing with a small test dataset (16 examples, ValueError with num_proc
>1) but the same code works fine for the full dataset (~70k examples).
@mariosasko Any idea on how to do that with a nested feature with lists of variable lengths containing dicts?
EDIT: Was able to narrow it down: >200 Examples: no error, <150 Examples: Error. Now idea what to make of this but pretty obvious that this is a bug....
This error also occurs while concatenating the datasets.