polars pivot may still introduce duplicate column names in some cases

Checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

df = pl.DataFrame({'a': [1,2], 'b': [1,2], 'c': ['x','x'], 'd': ['x', 'y']})
df.pivot(values='a', index='b', columns=['c', 'd'])

# shape: (2, 4)
# ┌─────┬─────┬──────┬──────┐
# │ b   ┆ x   ┆ x    ┆ y    │
# │ --- ┆ --- ┆ ---  ┆ ---  │
# │ i64 ┆ i64 ┆ i64  ┆ i64  │
# ╞═════╪═════╪══════╪══════╡
# │ 1   ┆ 1   ┆ 1    ┆ null │
# │ 2   ┆ 2   ┆ null ┆ 2    │
# └─────┴─────┴──────┴──────┘

Log output

No response

Issue description

This issue was created to fully resolve https://github.com/pola-rs/polars/issues/7731. After checking that issue out, I noticed that there were still examples provided that would result in a duplicate column name with the current version.

Expected behavior

The column names that pivot creates should be unique and follow the structure discussed in https://github.com/pola-rs/polars/issues/7731

In the case of the example, I believe the output should be

shape: (2, 4)
┌─────┬─────┬──────┬──────┐
│ b   ┆ c_x ┆ d_x  ┆ d_y  │
│ --- ┆ --- ┆ ---  ┆ ---  │
│ i64 ┆ i64 ┆ i64  ┆ i64  │
╞═════╪═════╪══════╪══════╡
│ 1   ┆ 1   ┆ 1    ┆ null │
│ 2   ┆ 2   ┆ null ┆ 2    │
└─────┴─────┴──────┴──────┘

however please correct me if I am wrong, or let me know if you disagree.

Installed versions

--------Version info---------
Polars:              0.19.8
Index type:          UInt32
Platform:            Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python:              3.11.4 (main, Jun  8 2023, 17:02:11) [GCC 9.4.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              <not installed>
gevent:              <not installed>
matplotlib:          <not installed>
numpy:               1.26.0
openpyxl:            <not installed>
pandas:              2.1.1
pyarrow:             <not installed>
pydantic:            <not installed>
pyiceberg:           <not installed>
pyxlsb:              <not installed>
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>

Oct 11 '23 12:10 henryharbeck

Hitting this issue as well.

Nov 02 '23 15:11 dan714

Hitting this issue as well.

Sorry - my issue was actually the resulting columns were a mix of floats and integers so 0.0 and 0 were separated by the pivot. Just needed to make the types consistent for my issue.

Nov 02 '23 15:11 dan714

I can reproduce this issue.

I am also, getting an slightly different issue like this:

df = pl.DataFrame({'a':['beep', 'bop'],'b':['a','b'],'c':['s','f'],'d':[7,8], 'e':['x', 'y']}) df.pivot(values=['a'], index='b', columns=['c', 'e'], aggregate_function='count')

shape: (2, 5)

b	s	f	x	y
str	u32	u32	u32	u32
"a"	1	null	1	null
"b"	null	1	null	1

As above, expected format would be:

b	s_x	f _y
str	u32	u32
"a"	1	null
"b"	null	1

it seems like it could have a similar route cause

Nov 06 '23 01:11 AaronCreighton

(I could be mistaken @AaronCreighton - but from a quick look, it does seem your example is caused by a different issue.)

With regards to the initial example: Is the problem here that c and d both contain the value x?

Meaning there needs to be a check for intersecting values in any of the columns= and if so, modify the headers?

[/Users/user/git/polars/crates/polars-ops/src/frame/pivot/mod.rs:249] headers.apply_values(|v| Cow::from(format!("{column_column_name}{sep}{v}"))) = shape: (1,)
ChunkedArray: 'c' [str]
[
	"c_x"
]
[/Users/user/git/polars/crates/polars-ops/src/frame/pivot/mod.rs:249] headers.apply_values(|v| Cow::from(format!("{column_column_name}{sep}{v}"))) = shape: (2,)
ChunkedArray: 'd' [str]
[
	"d_x"
	"d_y"
]

Nov 06 '23 03:11 cmdlineluser

@cmdlineluser good to know, i can create a new issue for this if needed.

Nov 06 '23 03:11 AaronCreighton

This happens when there are multiple columns, but works fine when there are multiple values (see https://github.com/pola-rs/polars/issues/13470).

I think the fix is as simple as changing: https://github.com/pola-rs/polars/blob/7214de240f26307d3b40645b407bb52dae6507c3/crates/polars-ops/src/frame/pivot/mod.rs#L247

to:

if values.len() > 1 || columns.len() > 1 {

Jan 06 '24 06:01 Wainberg

From discussion: the output should group by all columns, rather than treating them separately. So the output should be same as can be achieved currently with:

In [2]: df = pl.DataFrame({'a': [1,2], 'b': [1,2], 'c': ['x','x'], 'd': ['x', 'y']})
   ...: df.with_columns(c_d=pl.concat_str('c', 'd', separator='_')).pivot(values='a', index='b', columns='c_d')
Out[2]:
shape: (2, 3)
┌─────┬──────┬──────┐
│ b   ┆ x_x  ┆ x_y  │
│ --- ┆ ---  ┆ ---  │
│ i64 ┆ i64  ┆ i64  │
╞═════╪══════╪══════╡
│ 1   ┆ 1    ┆ null │
│ 2   ┆ null ┆ 2    │
└─────┴──────┴──────┘

This solves the duplicate columns issue, and aligns with tidyverse and pandas

Instead of

def my_pivot(df: pl.DataFrame, columns: list[str], index: str, values: str) -> pl.DataFrame:
    dfs: list[pl.DataFrame] = []
    for column in columns:
        dfs.append(df.group_by(index).agg(
            *[
                pl.col(values).filter(pl.col(column) == value).alias(f'{column}_{value}')
                for value in df[column].unique()
            ]
        ))
    result = reduce(lambda x, y: x.join(y, on=index), dfs)
    return result

more like

def my_pivot_new(df: pl.DataFrame, columns: list[str], index: str, values: str) -> pl.DataFrame:
    unique_values = df.select(struct=pl.struct(columns).unique())['struct']
    return df.group_by(index).agg(
        *[
            pl.col(values).filter(
                [pl.col(column) == value[column] for column in columns]
            ).alias('_'.join(value.values()))
            for value in unique_values
        ]
    )

Jan 26 '24 19:01 MarcoGorelli

polars polars copied to clipboard

pivot may still introduce duplicate column names in some cases

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

polars
polars copied to clipboard