polars
polars copied to clipboard
create struct columns from sequences of tuples
Problem description
nested lists are identified as type struct
>>> pl.Series([{"a": 1, "b": [[1, 2], [3, 4]]}])
shape: (1,)
Series: '' [struct[2]]
[
{1,[[1, 2], [3, 4]]}
]
tuples remain as object
>>> pl.Series([{"a": 1, "b": ([1, 2], [3, 4])}])
shape: (1,)
Series: '' [o][object]
[
{'a': 1, 'b': ([1, 2], [3, 4])}
]
>>> pl.Series([{"a": 1, "b": ((1, 2), (3, 4))}])
shape: (1,)
Series: '' [o][object]
[
{'a': 1, 'b': ((1, 2), (3, 4))}
]
We don't have a tuple type. A tuple typically has different types. E.g. a tuple is heterogeneus.
Does setting a schema work?
How would I set a schema in this case?
The data was coming from elsewhere in the form of tuples.
>>> def gen_data(): return ((1, 2), (3, 4), (5, 6))
>>> pl.Series([{"a": 1, "b": [gen_data()]}])
shape: (1,)
Series: '' [o][object]
[
{'a': 1, 'b': [((1, 2), (3, 4), (5, 6))]}
]
If they need to be converted manually beforehand that's fine - it seems numpy is the simplest way:
>>> pl.Series([{"a": 1, "b": np.array(gen_data())}])
shape: (1,)
Series: '' [o][object]
[
{'a': 1, 'b': array([[1, 2],
[3, 4],
[5, 6]])}
]
>>> pl.Series([{"a": 1, "b": [np.array(gen_data()).tolist()]}])
shape: (1,)
Series: '' [struct[2]]
[
{1,[[[1, 2], [3, 4], [5, 6]]]}
]
I think I may have messed up the title of my post - apologies for that - the struct part is not really relevant.
I guess it has to do with me not understanding the difference between:
>>> pl.DataFrame({"a": [(1, 2), (3, 4)]})
shape: (2, 1)
┌───────────┐
│ a │
│ --- │
│ list[i64] │
╞═══════════╡
│ [1, 2] │
│ [3, 4] │
└───────────┘
>>> pl.DataFrame({"a": [[1, 2], [3, 4]]})
shape: (2, 1)
┌───────────┐
│ a │
│ --- │
│ list[i64] │
╞═══════════╡
│ [1, 2] │
│ [3, 4] │
└───────────┘
and:
>>> pl.DataFrame({"a": [[(1, 2)], [(3, 4)]]})
shape: (2, 1)
┌──────────┐
│ a │
│ --- │
│ object │
╞══════════╡
│ [(1, 2)] │
│ [(3, 4)] │
└──────────┘
>>> pl.DataFrame({"a": [[[1, 2]], [[3, 4]]]})
shape: (2, 1)
┌─────────────────┐
│ a │
│ --- │
│ list[list[i64]] │
╞═════════════════╡
│ [[1, 2]] │
│ [[3, 4]] │
└─────────────────┘
Can a schema force the tuple version to result in list[list[i64]]
?
>>> pl.DataFrame({"a": [[(1, 2)], [(3, 4)]]}, schema={"a": pl.List(pl.List(pl.Int64))})
shape: (2, 1)
┌──────────┐
│ a │
│ --- │
│ object │
╞══════════╡
│ [(1, 2)] │
├──────────┤
│ [(3, 4)] │
└──────────┘
Or does the data need to converted first?
From python's perspective a tuple
is little more than an immutable list
; we should probably handle them the same when we initialise data 🤔
@alexander-beedie I just noticed your comment in https://github.com/pola-rs/polars/issues/6891#issuecomment-1430998111 which means my previous attempt now returns a list[list[i64]]
column instead of object
>>> pl.DataFrame({"a": [[(1, 2)], [(3, 4)]]}, schema={"a": pl.List(pl.List(pl.Int64))})
shape: (2, 1)
┌─────────────────┐
│ a │
│ --- │
│ list[list[i64]] │
╞═════════════════╡
│ [[1, 2]] │
│ [[3, 4]] │
└─────────────────┘
Please feel free to close this, thank you.
(Closed by https://github.com/pola-rs/polars/pull/6795).