polars icon indicating copy to clipboard operation
polars copied to clipboard

create struct columns from sequences of tuples

Open cmdlineluser opened this issue 2 years ago • 3 comments

Problem description

nested lists are identified as type struct

>>> pl.Series([{"a": 1, "b": [[1, 2], [3, 4]]}])
shape: (1,)
Series: '' [struct[2]]
[
	{1,[[1, 2], [3, 4]]}
]

tuples remain as object

>>> pl.Series([{"a": 1, "b": ([1, 2], [3, 4])}])
shape: (1,)
Series: '' [o][object]
[
	{'a': 1, 'b': ([1, 2], [3, 4])}
]
>>> pl.Series([{"a": 1, "b": ((1, 2), (3, 4))}])
shape: (1,)
Series: '' [o][object]
[
	{'a': 1, 'b': ((1, 2), (3, 4))}
]

cmdlineluser avatar Jan 26 '23 15:01 cmdlineluser

We don't have a tuple type. A tuple typically has different types. E.g. a tuple is heterogeneus.

Does setting a schema work?

ritchie46 avatar Jan 26 '23 15:01 ritchie46

How would I set a schema in this case?

The data was coming from elsewhere in the form of tuples.

>>> def gen_data(): return ((1, 2), (3, 4), (5, 6))
>>> pl.Series([{"a": 1, "b": [gen_data()]}])
shape: (1,)
Series: '' [o][object]
[
	{'a': 1, 'b': [((1, 2), (3, 4), (5, 6))]}
]

If they need to be converted manually beforehand that's fine - it seems numpy is the simplest way:

>>> pl.Series([{"a": 1, "b": np.array(gen_data())}])
shape: (1,)
Series: '' [o][object]
[
	{'a': 1, 'b': array([[1, 2],
       [3, 4],
       [5, 6]])}
]
>>> pl.Series([{"a": 1, "b": [np.array(gen_data()).tolist()]}])
shape: (1,)
Series: '' [struct[2]]
[
	{1,[[[1, 2], [3, 4], [5, 6]]]}
]

cmdlineluser avatar Jan 26 '23 17:01 cmdlineluser

I think I may have messed up the title of my post - apologies for that - the struct part is not really relevant.

I guess it has to do with me not understanding the difference between:

>>> pl.DataFrame({"a": [(1, 2), (3, 4)]})
shape: (2, 1)
┌───────────┐
│ a         │
│ ---       │
│ list[i64] │
╞═══════════╡
│ [1, 2]    │
│ [3, 4]    │
└───────────┘
>>> pl.DataFrame({"a": [[1, 2], [3, 4]]})
shape: (2, 1)
┌───────────┐
│ a         │
│ ---       │
│ list[i64] │
╞═══════════╡
│ [1, 2]    │
│ [3, 4]    │
└───────────┘

and:

>>> pl.DataFrame({"a": [[(1, 2)], [(3, 4)]]})
shape: (2, 1)
┌──────────┐
│ a        │
│ ---      │
│ object   │
╞══════════╡
│ [(1, 2)] │
│ [(3, 4)] │
└──────────┘
>>> pl.DataFrame({"a": [[[1, 2]], [[3, 4]]]})
shape: (2, 1)
┌─────────────────┐
│ a               │
│ ---             │
│ list[list[i64]] │
╞═════════════════╡
│ [[1, 2]]        │
│ [[3, 4]]        │
└─────────────────┘

Can a schema force the tuple version to result in list[list[i64]]?

>>> pl.DataFrame({"a": [[(1, 2)], [(3, 4)]]}, schema={"a": pl.List(pl.List(pl.Int64))})
shape: (2, 1)
┌──────────┐
│ a        │
│ ---      │
│ object   │
╞══════════╡
│ [(1, 2)] │
├──────────┤
│ [(3, 4)] │
└──────────┘

Or does the data need to converted first?

cmdlineluser avatar Jan 28 '23 17:01 cmdlineluser

From python's perspective a tuple is little more than an immutable list; we should probably handle them the same when we initialise data 🤔

alexander-beedie avatar Feb 02 '23 10:02 alexander-beedie

@alexander-beedie I just noticed your comment in https://github.com/pola-rs/polars/issues/6891#issuecomment-1430998111 which means my previous attempt now returns a list[list[i64]] column instead of object

>>> pl.DataFrame({"a": [[(1, 2)], [(3, 4)]]}, schema={"a": pl.List(pl.List(pl.Int64))})
shape: (2, 1)
┌─────────────────┐
│ a               │
│ ---             │
│ list[list[i64]] │
╞═════════════════╡
│ [[1, 2]]        │
│ [[3, 4]]        │
└─────────────────┘

Please feel free to close this, thank you.

cmdlineluser avatar Feb 15 '23 12:02 cmdlineluser

(Closed by https://github.com/pola-rs/polars/pull/6795).

alexander-beedie avatar Feb 15 '23 12:02 alexander-beedie