Dataset creation is broken if nesting a dict inside a dict inside a list
Describe the bug
Hey,
I noticed that the creation of datasets with Dataset.from_generator is broken if dicts and lists are nested in a certain way and a schema is being passed. See below for details.
Best, Tim
Steps to reproduce the bug
Runing this code:
from datasets import Dataset, Features, Sequence, Value
def generator():
yield {
"a": [{"b": {"c": 0}}],
}
features = Features(
{
"a": Sequence(
feature={
"b": {
"c": Value("int32"),
},
},
length=1,
)
}
)
dataset = Dataset.from_generator(generator, features=features)
leads to
Generating train split: 1 examples [00:00, 540.85 examples/s]
Traceback (most recent call last):
File "/home/user/miniconda3/envs/test/lib/python3.11/site-packages/datasets/builder.py", line 1635, in _prepare_split_single
num_examples, num_bytes = writer.finalize()
^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/test/lib/python3.11/site-packages/datasets/arrow_writer.py", line 657, in finalize
self.write_examples_on_file()
File "/home/user/miniconda3/envs/test/lib/python3.11/site-packages/datasets/arrow_writer.py", line 510, in write_examples_on_file
self.write_batch(batch_examples=batch_examples)
File "/home/user/miniconda3/envs/test/lib/python3.11/site-packages/datasets/arrow_writer.py", line 629, in write_batch
pa_table = pa.Table.from_arrays(arrays, schema=schema)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/table.pxi", line 4851, in pyarrow.lib.Table.from_arrays
File "pyarrow/table.pxi", line 1608, in pyarrow.lib._sanitize_arrays
File "pyarrow/array.pxi", line 399, in pyarrow.lib.asarray
File "pyarrow/array.pxi", line 1004, in pyarrow.lib.Array.cast
File "/home/user/miniconda3/envs/test/lib/python3.11/site-packages/pyarrow/compute.py", line 405, in cast
return call_function("cast", [arr], options, memory_pool)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_compute.pyx", line 598, in pyarrow._compute.call_function
File "pyarrow/_compute.pyx", line 393, in pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from fixed_size_list<item: struct<c: int32>>[1] to struct using function cast_struct
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/user/test/tools/hf_test2.py", line 23, in <module>
dataset = Dataset.from_generator(generator, features=features)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/miniconda3/envs/test/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 1114, in from_generator
).read()
^^^^^^
File "/home/user/miniconda3/envs/test/lib/python3.11/site-packages/datasets/io/generator.py", line 49, in read
self.builder.download_and_prepare(
File "/home/user/miniconda3/envs/test/lib/python3.11/site-packages/datasets/builder.py", line 925, in download_and_prepare
self._download_and_prepare(
File "/home/user/miniconda3/envs/test/lib/python3.11/site-packages/datasets/builder.py", line 1649, in _download_and_prepare
super()._download_and_prepare(
File "/home/user/miniconda3/envs/test/lib/python3.11/site-packages/datasets/builder.py", line 1001, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/user/miniconda3/envs/test/lib/python3.11/site-packages/datasets/builder.py", line 1487, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/home/user/miniconda3/envs/test/lib/python3.11/site-packages/datasets/builder.py", line 1644, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
Process finished with exit code 1
Expected behavior
I expected this code not to lead to an error.
I have done some digging and figured out that the problem seems to be the get_nested_type function in features.py, which, for whatever reason, flips Sequences and dicts whenever it encounters a dict inside of a sequence. This seems to be necessary, as disabling that flip leads to another error. However, by keeping that flip enabled for the highest level and disabling it for all subsequent levels, I was able to work around this problem. Specifically, by patching get_nested_type as follows, it works on the given example (emphasis on the level parameter I added):
def get_nested_type(schema: FeatureType, level=0) -> pa.DataType:
"""
get_nested_type() converts a datasets.FeatureType into a pyarrow.DataType, and acts as the inverse of
generate_from_arrow_type().
It performs double-duty as the implementation of Features.type and handles the conversion of
datasets.Feature->pa.struct
"""
# Nested structures: we allow dict, list/tuples, sequences
if isinstance(schema, Features):
return pa.struct(
{key: get_nested_type(schema[key], level = level + 1) for key in schema}
) # Features is subclass of dict, and dict order is deterministic since Python 3.6
elif isinstance(schema, dict):
return pa.struct(
{key: get_nested_type(schema[key], level = level + 1) for key in schema}
) # however don't sort on struct types since the order matters
elif isinstance(schema, (list, tuple)):
if len(schema) != 1:
raise ValueError("When defining list feature, you should just provide one example of the inner type")
value_type = get_nested_type(schema[0], level = level + 1)
return pa.list_(value_type)
elif isinstance(schema, LargeList):
value_type = get_nested_type(schema.feature, level = level + 1)
return pa.large_list(value_type)
elif isinstance(schema, Sequence):
value_type = get_nested_type(schema.feature, level = level + 1)
# We allow to reverse list of dict => dict of list for compatibility with tfds
if isinstance(schema.feature, dict) and level == 1:
data_type = pa.struct({f.name: pa.list_(f.type, schema.length) for f in value_type})
else:
data_type = pa.list_(value_type, schema.length)
return data_type
# Other objects are callable which returns their data type (ClassLabel, Array2D, Translation, Arrow datatype creation methods)
return schema()
I have honestly no idea what I am doing here, so this might produce other issues for different inputs.
Environment info
datasetsversion: 3.6.0- Platform: Linux-6.8.0-59-generic-x86_64-with-glibc2.35
- Python version: 3.11.11
huggingface_hubversion: 0.30.2- PyArrow version: 19.0.1
- Pandas version: 2.2.3
fsspecversion: 2024.12.0
Also tested it with 3.5.0, same result.
Hi ! That's because Séquence is a type that comes from tensorflow datasets and inverts lists and focus when doing Séquence(dict).
Instead you should use a list. In your case
features = Features({
"a": [{"b": {"c": Value("string")}}]
})
Hi,
Thanks for the swift reply! Could you quickly clarify a couple of points?
- Is there any benefit in using Sequence over normal lists? Especially for longer lists (in my case, up to 256 entries)
- When exactly can I use Sequence? If there is a maximum of one level of dictionaries inside, then it's always fine?
- When creating the data in the generator, do I need to swap lists and dicts manually, or does that happen automatically?
Also, the documentation does not seem to mention this limitation of the Sequence type anywhere and encourages users to use it here. In fact, I did not even know that just using a Python list was an option. Maybe the documentation can be improved to mention the limitations of Sequence and highlight that lists can be used instead.
Thanks a lot in advance!
Best, Tim