polars icon indicating copy to clipboard operation
polars copied to clipboard

Support adding prefixes to `DataFrame.unnest`

Open bingbong-sempai opened this issue 1 year ago • 5 comments

Problem description

DataFrame.unnest currently fails when nested field names are shared across multiple columns - for example two columns can have a nested field named "other".
It would be nice to assign a prefix to unnested fields so that you wouldn't have to manually unnest each column.
Perhaps unnest could have a kwarg "prefix" which would be a dictionary of column names and the desired prefix.

bingbong-sempai avatar Jul 09 '23 07:07 bingbong-sempai

Yes, this would be a useful addition.

Just thought I'd add that I think perhaps it would be of more benefit to be able to do this outside of .unnest()?

e.g. something like I asked in #9613

That would also allow it to be used in other situations and .unnest() could remain as it is.

cmdlineluser avatar Jul 09 '23 08:07 cmdlineluser

Ah yeah, adding prefixes to all struct fields would also solve the DuplicateError.
But it seems like an expensive operation since all nested structs have to be edited Compared to adding prefixes after the unnest operation (but before merging the unnested dataframes generated by each column)

bingbong-sempai avatar Jul 09 '23 14:07 bingbong-sempai

@bingbong-sempai If you're referring to the unnest function I posted in the linked issue, they're only edited to change their name. They aren't recreated so I don't think it's actually very expensive.

Also, in that unnest function, if the extra parameters aren't invoked then it performs the exact same mechanics as exist now. I just copied the source of unnest and added the parameters with some if statements before the extra functionality.

deanm0000 avatar Jul 11 '23 15:07 deanm0000

I´ll add my following workaround for a similar problem. Unnesting dataframes with (deeply) nested structs.

Example data


data = [
    {"a": {"b": 1, "c": {"d":2}}, "b": {"b": 1}, "c": 2},
    {"a": {"b": 2, "c": {"d":3}}, "b": {"b": 10}, "c": 12},
]

df = pl.DataFrame(data)
prin(df)

shape: (2, 3)
┌───────────┬───────────┬─────┐
│ a         ┆ b         ┆ c   │
│ ---       ┆ ---       ┆ --- │
│ struct[2] ┆ struct[1] ┆ i64 │
╞═══════════╪═══════════╪═════╡
│ {1,{2}}   ┆ {1}       ┆ 2   │
│ {2,{3}}   ┆ {10}      ┆ 12  │
└───────────┴───────────┴─────┘

unnest_all function

import polars as pl

def unnest_all(self: pl.DataFrame, seperator="_"):
    def _unnest_all(struct_columns):
        return self.with_columns(
            [
                pl.col(col).struct.rename_fields(
                    [
                        f"{col}{seperator}{field_name}"
                        for field_name in self[col].struct.fields
                    ]
                )
                for col in struct_columns
            ]
        ).unnest(struct_columns)
        
    struct_columns = [col for col in self.columns if self[col].dtype == pl.Struct()]
    while len(struct_columns):
        self = _unnest_all(struct_columns=struct_columns)
        struct_columns = [col for col in self.columns if self[col].dtype == pl.Struct()]
        
    return self

pl.DataFrame.unnest_all = unnest_all

Result

df_unnested = df.unnest_all()
print(df_unnested)

shape: (2, 4)
┌─────┬───────┬─────┬─────┐
│ a_b ┆ a_c_d ┆ b_b ┆ c   │
│ --- ┆ ---   ┆ --- ┆ --- │
│ i64 ┆ i64   ┆ i64 ┆ i64 │
╞═════╪═══════╪═════╪═════╡
│ 1   ┆ 2     ┆ 1   ┆ 2   │
│ 2   ┆ 3     ┆ 10  ┆ 12  │
└─────┴───────┴─────┴─────┘

legout avatar Jul 31 '23 14:07 legout

In the R package tidyr they also have the parameter names_sep which I find very useful. In polars this could be something like:

df = pl.DataFrame(
    {"a": {
        "x": [1, 2, 3],
        "y": ["a", "b", "c"]
    }}
)

df.unnest("a", names_sep=None) # the current default, i.e. `df.unnest("a")`
# shape: (3, 2)
# ┌─────┬─────┐
# │ x   ┆ y   │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 1   ┆ a   │
# │ 2   ┆ b   │
# │ 3   ┆ c   │
# └─────┴─────┘

df.unnest("a", names_sep="_")
# shape: (3, 2)
# ┌─────┬─────┐
# │ a_x ┆ a_y │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 1   ┆ a   │
# │ 2   ┆ b   │
# │ 3   ┆ c   │
# └─────┴─────┘

There could also be an even more flexible option with names_pattern that supports string interpolation with 2 special values cols (the column names) and fields (the field names)

df.unnest("a", names_pattern="{cols}_{fields}")
# shape: (3, 2)
# ┌─────┬─────┐
# │ a_x ┆ a_y │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 1   ┆ a   │
# │ 2   ┆ b   │
# │ 3   ┆ c   │
# └─────┴─────┘

df.unnest("a", names_pattern="{fields}_{cols}")
# shape: (3, 2)
# ┌─────┬─────┐
# │ x_a ┆ y_a │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 1   ┆ a   │
# │ 2   ┆ b   │
# │ 3   ┆ c   │
# └─────┴─────┘

mgirlich avatar Mar 22 '24 08:03 mgirlich

Yes, would be a great feature

mkleinbort-ic avatar Apr 06 '24 09:04 mkleinbort-ic

.name.map_fields() has since been added which simplifies things somewhat.

(albeit as a separate step)

cmdlineluser avatar Apr 06 '24 10:04 cmdlineluser