polars Support adding prefixes to `DataFrame.unnest`

Problem description

DataFrame.unnest currently fails when nested field names are shared across multiple columns - for example two columns can have a nested field named "other".
It would be nice to assign a prefix to unnested fields so that you wouldn't have to manually unnest each column.
Perhaps unnest could have a kwarg "prefix" which would be a dictionary of column names and the desired prefix.

Jul 09 '23 07:07 bingbong-sempai

Yes, this would be a useful addition.

Just thought I'd add that I think perhaps it would be of more benefit to be able to do this outside of .unnest()?

e.g. something like I asked in #9613

That would also allow it to be used in other situations and .unnest() could remain as it is.

Jul 09 '23 08:07 cmdlineluser

Ah yeah, adding prefixes to all struct fields would also solve the DuplicateError.
But it seems like an expensive operation since all nested structs have to be edited Compared to adding prefixes after the unnest operation (but before merging the unnested dataframes generated by each column)

Jul 09 '23 14:07 bingbong-sempai

@bingbong-sempai If you're referring to the unnest function I posted in the linked issue, they're only edited to change their name. They aren't recreated so I don't think it's actually very expensive.

Also, in that unnest function, if the extra parameters aren't invoked then it performs the exact same mechanics as exist now. I just copied the source of unnest and added the parameters with some if statements before the extra functionality.

Jul 11 '23 15:07 deanm0000

I´ll add my following workaround for a similar problem. Unnesting dataframes with (deeply) nested structs.

Example data


data = [
    {"a": {"b": 1, "c": {"d":2}}, "b": {"b": 1}, "c": 2},
    {"a": {"b": 2, "c": {"d":3}}, "b": {"b": 10}, "c": 12},
]

df = pl.DataFrame(data)
prin(df)

shape: (2, 3)
┌───────────┬───────────┬─────┐
│ a         ┆ b         ┆ c   │
│ ---       ┆ ---       ┆ --- │
│ struct[2] ┆ struct[1] ┆ i64 │
╞═══════════╪═══════════╪═════╡
│ {1,{2}}   ┆ {1}       ┆ 2   │
│ {2,{3}}   ┆ {10}      ┆ 12  │
└───────────┴───────────┴─────┘

unnest_all function

import polars as pl

def unnest_all(self: pl.DataFrame, seperator="_"):
    def _unnest_all(struct_columns):
        return self.with_columns(
            [
                pl.col(col).struct.rename_fields(
                    [
                        f"{col}{seperator}{field_name}"
                        for field_name in self[col].struct.fields
                    ]
                )
                for col in struct_columns
            ]
        ).unnest(struct_columns)
        
    struct_columns = [col for col in self.columns if self[col].dtype == pl.Struct()]
    while len(struct_columns):
        self = _unnest_all(struct_columns=struct_columns)
        struct_columns = [col for col in self.columns if self[col].dtype == pl.Struct()]
        
    return self

pl.DataFrame.unnest_all = unnest_all

Result

df_unnested = df.unnest_all()
print(df_unnested)

shape: (2, 4)
┌─────┬───────┬─────┬─────┐
│ a_b ┆ a_c_d ┆ b_b ┆ c   │
│ --- ┆ ---   ┆ --- ┆ --- │
│ i64 ┆ i64   ┆ i64 ┆ i64 │
╞═════╪═══════╪═════╪═════╡
│ 1   ┆ 2     ┆ 1   ┆ 2   │
│ 2   ┆ 3     ┆ 10  ┆ 12  │
└─────┴───────┴─────┴─────┘

Jul 31 '23 14:07 legout

In the R package tidyr they also have the parameter names_sep which I find very useful. In polars this could be something like:

df = pl.DataFrame(
    {"a": {
        "x": [1, 2, 3],
        "y": ["a", "b", "c"]
    }}
)

df.unnest("a", names_sep=None) # the current default, i.e. `df.unnest("a")`
# shape: (3, 2)
# ┌─────┬─────┐
# │ x   ┆ y   │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 1   ┆ a   │
# │ 2   ┆ b   │
# │ 3   ┆ c   │
# └─────┴─────┘

df.unnest("a", names_sep="_")
# shape: (3, 2)
# ┌─────┬─────┐
# │ a_x ┆ a_y │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 1   ┆ a   │
# │ 2   ┆ b   │
# │ 3   ┆ c   │
# └─────┴─────┘

There could also be an even more flexible option with names_pattern that supports string interpolation with 2 special values cols (the column names) and fields (the field names)

df.unnest("a", names_pattern="{cols}_{fields}")
# shape: (3, 2)
# ┌─────┬─────┐
# │ a_x ┆ a_y │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 1   ┆ a   │
# │ 2   ┆ b   │
# │ 3   ┆ c   │
# └─────┴─────┘

df.unnest("a", names_pattern="{fields}_{cols}")
# shape: (3, 2)
# ┌─────┬─────┐
# │ x_a ┆ y_a │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 1   ┆ a   │
# │ 2   ┆ b   │
# │ 3   ┆ c   │
# └─────┴─────┘

Mar 22 '24 08:03 mgirlich

Yes, would be a great feature

Apr 06 '24 09:04 mkleinbort-ic

.name.map_fields() has since been added which simplifies things somewhat.

(albeit as a separate step)

Apr 06 '24 10:04 cmdlineluser

polars polars copied to clipboard

Support adding prefixes to `DataFrame.unnest`

Problem description

Example data

unnest_all function

Result

polars
polars copied to clipboard