polars
polars copied to clipboard
Support for Map DataType
Problem description
Most data processing systems/ data frame libs have a non-strict MapType (dict/ HashMap), are any plans to support this in Polars (rust/ py) as well?
Ref arrow type: https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Map https://arrow.apache.org/docs/python/generated/pyarrow.map_.html#pyarrow.map_
I don't think it is worth the extra code bloat and complexity. It is a List<struct<2>> physically, so that's how we read them in polars.
I think FixedSizeList and Decimal have much, much higher prio.
I don't think it is worth the extra code bloat and complexity. It is a
List<struct<2>>physically, so that's how we read them in polars.I think
FixedSizeListandDecimalhave much, much higher prio.
Sure. just added a tracking ticket for now or later.
with a List<struct<2>> the user will lose the random key look up right since lists are sequential?
I was wondering if there were any plans to implement maps, now that fixedsizelist and decimal have been added. I have some dicts with variable keys that I'd like to work with; is there a simple workaround with the List<struct<2>> type?
I don't think it is worth the extra code bloat and complexity. It is a
List<struct<2>>physically, so that's how we read them in polars.I think
FixedSizeListandDecimalhave much, much higher prio.
@ritchie46 Would the Map/List<struct<2>> type solve issues like #10234? I run into the empty struct issue quite regularly when dealing with api response data (which polars has been a game changer for dealing with) however, if any empty dictionaries are in the response payload polars outright fails to parse the data into a dataframe. And when responses a deeply nested it can be pretty untenable to attempt to remove all empty dictionaries.
we have issues with duckdb generating an arrow maptype which is not directly supported. it would be great to have polars to completely support the maptype for compatibility with arrow.
Any news on this?
Can we have an update on this issue?
Specifically - why is this an issue with polars 1.21.0 but not with, say, 1.14.0? Same data loads perfectly fine with an older polars version.
@JankoJerinic You would need to provide more details.
Ideally by opening a new issue with a reproducible example.
Actually - this is an interesting one. The issue only surfaces when the input PyArrow column is empty, otherwise things work out just fine. Try running this on polars 1.21.0.
import pyarrow as pa
import polars as pl
map_data = []
map_array = pa.array(map_data, type=pa.map_(pa.string(), pa.string()))
pa_table = pa.table([map_array], names=['generic_map'])
df = pl.from_arrow(pa_table)
This will fail with a fairly misleading message:
thread '<unnamed>' panicked at crates/polars-core/src/datatypes/field.rs:234:19:
Arrow datatype Map(Field { name: "entries", dtype: Struct([Field { name: "key", dtype: Utf8, is_nullable: false, metadata: None }, Field { name: "value", dtype: Utf8, is_nullable: true, metadata: None }]), is_nullable: false, metadata: None }, false) not supported by Polars. You probably need to activate that data-type feature.
Now, put something in the map_data, say:
map_data = [
{'foo': 'bar'},
{'foo': 'bar', 'baz': 'qoo'},
{'foo': 'bar', 'baz': 'qoo', 'qoo': 'qux'},
]
Polars will happily create the Series, with the correct type, as seen below. So, this is strictly an issue when the input table has no rows.
List(Struct({'key': String, 'value': String}))
shape: (3, 1)
┌─────────────────────────────────┐
│ generic_map │
│ --- │
│ list[struct[2]] │
╞═════════════════════════════════╡
│ [{"foo","bar"}] │
│ [{"foo","bar"}, {"baz","qoo"}] │
│ [{"foo","bar"}, {"baz","qoo"},… │
└─────────────────────────────────┘
@JankoJerinic You would need to provide more details.
Ideally by opening a new issue with a reproducible example.
Also - thank you for the quick reply and apologies for my initial terseness. I realized in real time that this actually only impacts conversion of empty tables.
I just confirmed it, given a column_type which is a PyArrow MapType, you can "trick" Polars into correctly converting an empty input table, without error, if you "recode" the input column into a list of structs.
if isinstance(column_type, pa.MapType) and len(table) == 0:
recoded_column = pa.array(
[],
type=pa.list_(pa.struct([("key", column_type.key_type), ("value", column_type.item_type)])),
)
Resulting Polars data frame:
List(Struct({'key': String, 'value': String}))
shape: (0, 1)
┌─────────────────┐
│ generic_map │
│ --- │
│ list[struct[2]] │
╞═════════════════╡
└─────────────────┘
Is this something where a pr would be welcomed? It seems like longer term it'd be significant work but it could start off small. Something like:
- [ ] add DataType (is this breaking?)
- [ ] Add array and mutable array variants
- [ ] Add serde for io where it would otherwise fail (is this breaking?)
- [ ] add conversion to/from struct
- [ ] Add serde for "native" map types that are currently converted to structs (behind a config)
- [ ] Add other necessary helpers
- [ ]
concat_map - [ ]
keys,values - [ ]
eval_keys,eval_values - [ ]
__get_item__ - [ ]
len - [ ] others...
- [ ]
where'd it'd provide some value after it can serde files it couldn't otherwise.