polars
polars copied to clipboard
Support adding prefixes to `DataFrame.unnest`
Problem description
DataFrame.unnest
currently fails when nested field names are shared across multiple columns - for example two columns can have a nested field named "other".
It would be nice to assign a prefix to unnested fields so that you wouldn't have to manually unnest each column.
Perhaps unnest could have a kwarg "prefix" which would be a dictionary of column names and the desired prefix.
Yes, this would be a useful addition.
Just thought I'd add that I think perhaps it would be of more benefit to be able to do this outside of .unnest()
?
e.g. something like I asked in #9613
That would also allow it to be used in other situations and .unnest()
could remain as it is.
Ah yeah, adding prefixes to all struct fields would also solve the DuplicateError
.
But it seems like an expensive operation since all nested structs have to be edited
Compared to adding prefixes after the unnest operation (but before merging the unnested dataframes generated by each column)
@bingbong-sempai If you're referring to the unnest
function I posted in the linked issue, they're only edited to change their name. They aren't recreated so I don't think it's actually very expensive.
Also, in that unnest function, if the extra parameters aren't invoked then it performs the exact same mechanics as exist now. I just copied the source of unnest and added the parameters with some if statements before the extra functionality.
I´ll add my following workaround for a similar problem. Unnesting dataframes with (deeply) nested structs.
Example data
data = [
{"a": {"b": 1, "c": {"d":2}}, "b": {"b": 1}, "c": 2},
{"a": {"b": 2, "c": {"d":3}}, "b": {"b": 10}, "c": 12},
]
df = pl.DataFrame(data)
prin(df)
shape: (2, 3)
┌───────────┬───────────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ struct[2] ┆ struct[1] ┆ i64 │
╞═══════════╪═══════════╪═════╡
│ {1,{2}} ┆ {1} ┆ 2 │
│ {2,{3}} ┆ {10} ┆ 12 │
└───────────┴───────────┴─────┘
unnest_all function
import polars as pl
def unnest_all(self: pl.DataFrame, seperator="_"):
def _unnest_all(struct_columns):
return self.with_columns(
[
pl.col(col).struct.rename_fields(
[
f"{col}{seperator}{field_name}"
for field_name in self[col].struct.fields
]
)
for col in struct_columns
]
).unnest(struct_columns)
struct_columns = [col for col in self.columns if self[col].dtype == pl.Struct()]
while len(struct_columns):
self = _unnest_all(struct_columns=struct_columns)
struct_columns = [col for col in self.columns if self[col].dtype == pl.Struct()]
return self
pl.DataFrame.unnest_all = unnest_all
Result
df_unnested = df.unnest_all()
print(df_unnested)
shape: (2, 4)
┌─────┬───────┬─────┬─────┐
│ a_b ┆ a_c_d ┆ b_b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═══════╪═════╪═════╡
│ 1 ┆ 2 ┆ 1 ┆ 2 │
│ 2 ┆ 3 ┆ 10 ┆ 12 │
└─────┴───────┴─────┴─────┘
In the R package tidyr
they also have the parameter names_sep
which I find very useful. In polars this could be something like:
df = pl.DataFrame(
{"a": {
"x": [1, 2, 3],
"y": ["a", "b", "c"]
}}
)
df.unnest("a", names_sep=None) # the current default, i.e. `df.unnest("a")`
# shape: (3, 2)
# ┌─────┬─────┐
# │ x ┆ y │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 1 ┆ a │
# │ 2 ┆ b │
# │ 3 ┆ c │
# └─────┴─────┘
df.unnest("a", names_sep="_")
# shape: (3, 2)
# ┌─────┬─────┐
# │ a_x ┆ a_y │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 1 ┆ a │
# │ 2 ┆ b │
# │ 3 ┆ c │
# └─────┴─────┘
There could also be an even more flexible option with names_pattern
that supports string interpolation with 2 special values cols
(the column names) and fields
(the field names)
df.unnest("a", names_pattern="{cols}_{fields}")
# shape: (3, 2)
# ┌─────┬─────┐
# │ a_x ┆ a_y │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 1 ┆ a │
# │ 2 ┆ b │
# │ 3 ┆ c │
# └─────┴─────┘
df.unnest("a", names_pattern="{fields}_{cols}")
# shape: (3, 2)
# ┌─────┬─────┐
# │ x_a ┆ y_a │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 1 ┆ a │
# │ 2 ┆ b │
# │ 3 ┆ c │
# └─────┴─────┘
Yes, would be a great feature
.name.map_fields()
has since been added which simplifies things somewhat.
(albeit as a separate step)