datasets
datasets copied to clipboard
Arrow map type in parquet files unsupported
Describe the bug
When I try to load parquet files that were processed with Spark, I get the following issue:
ValueError: Arrow type map<string, string ('warc_headers')> does not have a datasets dtype equivalent.
Strangely, loading the dataset with streaming=True
solves the issue.
Steps to reproduce the bug
The dataset is private, but this can be reproduced with any dataset that has Arrow maps.
Expected behavior
Loading the dataset no matter whether streaming is True or not.
Environment info
-
datasets
version: 2.10.1 - Platform: Linux-5.15.0-1029-gcp-x86_64-with-glibc2.31
- Python version: 3.10.7
- PyArrow version: 8.0.0
- Pandas version: 1.4.2
I'm attaching a minimal reproducible example:
from datasets import load_dataset
import pyarrow as pa
import pyarrow.parquet as pq
table_with_map = pa.Table.from_pydict(
{"a": [1, 2], "b": [[("a", 2)], [("b", 4)]]},
schema=pa.schema({"a": pa.int32(), "b": pa.map_(pa.string(), pa.int32())})
)
pq.write_table(table_with_map, "parquet_with_map.parquet")
dset = load_dataset("parquet", data_files="parquet_with_map.parquet", split="train") # error unless streaming=True
For a dataset generated with the packaged loaders (CSV, JSON, Parquet), streaming=True
sets the dataset's features to None
(unless explicitly provided in load_dataset
), hence no error will be thrown as long as the features stay "unresolved" (resolving the features with _resolve_features
will lead to an error).
I've also been wondering about datasets support for Arrow Map datatypes. I had a situation where I had a pandas series of dict[str, float] with hundreds of different possible key values (ie. not bounded), and this got converted to a sequence of structs where every single struct had the entire set of keys.
I worked around it, by explicitly creating a sequence of [str, float], but given that pyarrow has an explicit Map datatype, it would be good to be able to explicitly cast/force this data type combination.
(feel free to ignore) polars will not support this type: https://github.com/pola-rs/polars/issues/3942#issuecomment-1202331210
Polars will not add the map dtype. It's benefit do not outweigh the extra complexity. Maybe we can investigate conversion of maps to struct. But I will have to explore that.
Looks like they chose to convert every instance with https://github.com/pola-rs/polars/pull/4226