datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Dataset creation is broken if nesting a dict inside a dict inside a list

Open TimSchneider42 opened this issue 8 months ago • 2 comments

Describe the bug

Hey,

I noticed that the creation of datasets with Dataset.from_generator is broken if dicts and lists are nested in a certain way and a schema is being passed. See below for details.

Best, Tim

Steps to reproduce the bug

Runing this code:

from datasets import Dataset, Features, Sequence, Value


def generator():
    yield {
        "a": [{"b": {"c": 0}}],
    }


features = Features(
    {
        "a": Sequence(
            feature={
                "b": {
                    "c": Value("int32"),
                },
            },
            length=1,
        )
    }
)

dataset = Dataset.from_generator(generator, features=features)

leads to

Generating train split: 1 examples [00:00, 540.85 examples/s]
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/test/lib/python3.11/site-packages/datasets/builder.py", line 1635, in _prepare_split_single
    num_examples, num_bytes = writer.finalize()
                              ^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/test/lib/python3.11/site-packages/datasets/arrow_writer.py", line 657, in finalize
    self.write_examples_on_file()
  File "/home/user/miniconda3/envs/test/lib/python3.11/site-packages/datasets/arrow_writer.py", line 510, in write_examples_on_file
    self.write_batch(batch_examples=batch_examples)
  File "/home/user/miniconda3/envs/test/lib/python3.11/site-packages/datasets/arrow_writer.py", line 629, in write_batch
    pa_table = pa.Table.from_arrays(arrays, schema=schema)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 4851, in pyarrow.lib.Table.from_arrays
  File "pyarrow/table.pxi", line 1608, in pyarrow.lib._sanitize_arrays
  File "pyarrow/array.pxi", line 399, in pyarrow.lib.asarray
  File "pyarrow/array.pxi", line 1004, in pyarrow.lib.Array.cast
  File "/home/user/miniconda3/envs/test/lib/python3.11/site-packages/pyarrow/compute.py", line 405, in cast
    return call_function("cast", [arr], options, memory_pool)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_compute.pyx", line 598, in pyarrow._compute.call_function
  File "pyarrow/_compute.pyx", line 393, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from fixed_size_list<item: struct<c: int32>>[1] to struct using function cast_struct

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/test/tools/hf_test2.py", line 23, in <module>
    dataset = Dataset.from_generator(generator, features=features)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/test/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 1114, in from_generator
    ).read()
      ^^^^^^
  File "/home/user/miniconda3/envs/test/lib/python3.11/site-packages/datasets/io/generator.py", line 49, in read
    self.builder.download_and_prepare(
  File "/home/user/miniconda3/envs/test/lib/python3.11/site-packages/datasets/builder.py", line 925, in download_and_prepare
    self._download_and_prepare(
  File "/home/user/miniconda3/envs/test/lib/python3.11/site-packages/datasets/builder.py", line 1649, in _download_and_prepare
    super()._download_and_prepare(
  File "/home/user/miniconda3/envs/test/lib/python3.11/site-packages/datasets/builder.py", line 1001, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/user/miniconda3/envs/test/lib/python3.11/site-packages/datasets/builder.py", line 1487, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/home/user/miniconda3/envs/test/lib/python3.11/site-packages/datasets/builder.py", line 1644, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

Process finished with exit code 1

Expected behavior

I expected this code not to lead to an error.

I have done some digging and figured out that the problem seems to be the get_nested_type function in features.py, which, for whatever reason, flips Sequences and dicts whenever it encounters a dict inside of a sequence. This seems to be necessary, as disabling that flip leads to another error. However, by keeping that flip enabled for the highest level and disabling it for all subsequent levels, I was able to work around this problem. Specifically, by patching get_nested_type as follows, it works on the given example (emphasis on the level parameter I added):

def get_nested_type(schema: FeatureType, level=0) -> pa.DataType:
    """
    get_nested_type() converts a datasets.FeatureType into a pyarrow.DataType, and acts as the inverse of
        generate_from_arrow_type().

    It performs double-duty as the implementation of Features.type and handles the conversion of
        datasets.Feature->pa.struct
    """
    # Nested structures: we allow dict, list/tuples, sequences
    if isinstance(schema, Features):
        return pa.struct(
            {key: get_nested_type(schema[key], level = level + 1) for key in schema}
        )  # Features is subclass of dict, and dict order is deterministic since Python 3.6
    elif isinstance(schema, dict):
        return pa.struct(
            {key: get_nested_type(schema[key], level = level + 1) for key in schema}
        )  # however don't sort on struct types since the order matters
    elif isinstance(schema, (list, tuple)):
        if len(schema) != 1:
            raise ValueError("When defining list feature, you should just provide one example of the inner type")
        value_type = get_nested_type(schema[0], level = level + 1)
        return pa.list_(value_type)
    elif isinstance(schema, LargeList):
        value_type = get_nested_type(schema.feature, level = level + 1)
        return pa.large_list(value_type)
    elif isinstance(schema, Sequence):
        value_type = get_nested_type(schema.feature, level = level + 1)
        # We allow to reverse list of dict => dict of list for compatibility with tfds
        if isinstance(schema.feature, dict) and level == 1:
            data_type = pa.struct({f.name: pa.list_(f.type, schema.length) for f in value_type})
        else:
            data_type = pa.list_(value_type, schema.length)
        return data_type

    # Other objects are callable which returns their data type (ClassLabel, Array2D, Translation, Arrow datatype creation methods)
    return schema()

I have honestly no idea what I am doing here, so this might produce other issues for different inputs.

Environment info

  • datasets version: 3.6.0
  • Platform: Linux-6.8.0-59-generic-x86_64-with-glibc2.35
  • Python version: 3.11.11
  • huggingface_hub version: 0.30.2
  • PyArrow version: 19.0.1
  • Pandas version: 2.2.3
  • fsspec version: 2024.12.0

Also tested it with 3.5.0, same result.

TimSchneider42 avatar May 13 '25 21:05 TimSchneider42

Hi ! That's because Séquence is a type that comes from tensorflow datasets and inverts lists and focus when doing Séquence(dict).

Instead you should use a list. In your case

features = Features({
    "a": [{"b": {"c": Value("string")}}]
})

lhoestq avatar May 14 '25 23:05 lhoestq

Hi,

Thanks for the swift reply! Could you quickly clarify a couple of points?

  1. Is there any benefit in using Sequence over normal lists? Especially for longer lists (in my case, up to 256 entries)
  2. When exactly can I use Sequence? If there is a maximum of one level of dictionaries inside, then it's always fine?
  3. When creating the data in the generator, do I need to swap lists and dicts manually, or does that happen automatically?

Also, the documentation does not seem to mention this limitation of the Sequence type anywhere and encourages users to use it here. In fact, I did not even know that just using a Python list was an option. Maybe the documentation can be improved to mention the limitations of Sequence and highlight that lists can be used instead.

Thanks a lot in advance!

Best, Tim

TimSchneider42 avatar May 20 '25 19:05 TimSchneider42