datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Inconsistent "The features can't be aligned" error when combining map, multiprocessing, and variable length outputs

Open kheyer opened this issue 1 year ago • 3 comments

Describe the bug

I'm using a dataset with map and multiprocessing to run a function that returned a variable length list of outputs. This output list may be empty. Normally this is handled fine, but there is an edge case that crops up when using multiprocessing. In some cases, an empty list result ends up in a dataset shard consisting of a single item. This results in a The features can't be aligned error that is difficult to debug because it depends on the number of processes/shards used.

I've reproduced a minimal example below. My current workaround is to fill empty results with a dummy value that I filter after, but this was a weird error that took a while to track down.

Steps to reproduce the bug

import datasets

dataset = datasets.Dataset.from_list([{'idx':i} for i in range(60)])

def test_func(row, idx):
    if idx==58:
        return {'output': []}
    else:
        return {'output' : [{'test':1}, {'test':2}]}

# this works fine
test1 = dataset.map(lambda row, idx: test_func(row, idx), with_indices=True, num_proc=4)

# this fails
test2 = dataset.map(lambda row, idx: test_func(row, idx), with_indices=True, num_proc=32)
>ValueError: The features can't be aligned because the key output of features {'idx': Value(dtype='int64', id=None), 'output': Sequence(feature=Value(dtype='null', id=None), length=-1, id=None)} has unexpected type - Sequence(feature=Value(dtype='null', id=None), length=-1, id=None) (expected either [{'test': Value(dtype='int64', id=None)}] or Value("null").

The error occurs during the check

_check_if_features_can_be_aligned([dset.features for dset in dsets])

When the multiprocessing splitting lines up just right with the empty return value, one of the dset in dsets will have a single item with an empty list value, causing the error.

Expected behavior

Expected behavior is the result would be the same regardless of the num_proc value used.

Environment info

Datasets version 2.11.0 Python 3.9.16

kheyer avatar Jul 11 '23 20:07 kheyer

This scenario currently requires explicitly passing the target features (to avoid the error):

import datasets

...

features = dataset.features
features["output"] = = [{"test": datasets.Value("int64")}]
test2 = dataset.map(lambda row, idx: test_func(row, idx), with_indices=True, num_proc=32, features=features)

mariosasko avatar Jul 12 '23 15:07 mariosasko

I just encountered the same error in the same situation (multiprocessing with variable length outputs).

The funny (or dangerous?) thing is, that this error only showed up when testing with a small test dataset (16 examples, ValueError with num_proc >1) but the same code works fine for the full dataset (~70k examples).

@mariosasko Any idea on how to do that with a nested feature with lists of variable lengths containing dicts?

EDIT: Was able to narrow it down: >200 Examples: no error, <150 Examples: Error. Now idea what to make of this but pretty obvious that this is a bug....

jphme avatar Oct 25 '23 13:10 jphme

This error also occurs while concatenating the datasets.

Ananthzeke avatar Feb 10 '24 19:02 Ananthzeke