datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Arrow map type in parquet files unsupported

Open TevenLeScao opened this issue 1 year ago • 4 comments

Describe the bug

When I try to load parquet files that were processed with Spark, I get the following issue:

ValueError: Arrow type map<string, string ('warc_headers')> does not have a datasets dtype equivalent.

Strangely, loading the dataset with streaming=True solves the issue.

Steps to reproduce the bug

The dataset is private, but this can be reproduced with any dataset that has Arrow maps.

Expected behavior

Loading the dataset no matter whether streaming is True or not.

Environment info

  • datasets version: 2.10.1
  • Platform: Linux-5.15.0-1029-gcp-x86_64-with-glibc2.31
  • Python version: 3.10.7
  • PyArrow version: 8.0.0
  • Pandas version: 1.4.2

TevenLeScao avatar Mar 06 '23 12:03 TevenLeScao

I'm attaching a minimal reproducible example:

from datasets import load_dataset
import pyarrow as pa
import pyarrow.parquet as pq

table_with_map = pa.Table.from_pydict(
    {"a": [1, 2], "b": [[("a", 2)], [("b", 4)]]},
    schema=pa.schema({"a": pa.int32(), "b": pa.map_(pa.string(), pa.int32())})
)
pq.write_table(table_with_map, "parquet_with_map.parquet")
dset = load_dataset("parquet", data_files="parquet_with_map.parquet", split="train") # error unless streaming=True

For a dataset generated with the packaged loaders (CSV, JSON, Parquet), streaming=True sets the dataset's features to None (unless explicitly provided in load_dataset), hence no error will be thrown as long as the features stay "unresolved" (resolving the features with _resolve_features will lead to an error).

mariosasko avatar Mar 14 '23 17:03 mariosasko

I've also been wondering about datasets support for Arrow Map datatypes. I had a situation where I had a pandas series of dict[str, float] with hundreds of different possible key values (ie. not bounded), and this got converted to a sequence of structs where every single struct had the entire set of keys.

I worked around it, by explicitly creating a sequence of [str, float], but given that pyarrow has an explicit Map datatype, it would be good to be able to explicitly cast/force this data type combination.

eware-godaddy avatar Nov 26 '23 21:11 eware-godaddy

(feel free to ignore) polars will not support this type: https://github.com/pola-rs/polars/issues/3942#issuecomment-1202331210

Polars will not add the map dtype. It's benefit do not outweigh the extra complexity. Maybe we can investigate conversion of maps to struct. But I will have to explore that.

severo avatar Mar 15 '24 12:03 severo

Looks like they chose to convert every instance with https://github.com/pola-rs/polars/pull/4226

metasj avatar Mar 15 '24 18:03 metasj