polars icon indicating copy to clipboard operation
polars copied to clipboard

pivot may still introduce duplicate column names in some cases

Open henryharbeck opened this issue 1 year ago • 7 comments

Checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

df = pl.DataFrame({'a': [1,2], 'b': [1,2], 'c': ['x','x'], 'd': ['x', 'y']})
df.pivot(values='a', index='b', columns=['c', 'd'])

# shape: (2, 4)
# ┌─────┬─────┬──────┬──────┐
# │ b   ┆ x   ┆ x    ┆ y    │
# │ --- ┆ --- ┆ ---  ┆ ---  │
# │ i64 ┆ i64 ┆ i64  ┆ i64  │
# ╞═════╪═════╪══════╪══════╡
# │ 1   ┆ 1   ┆ 1    ┆ null │
# │ 2   ┆ 2   ┆ null ┆ 2    │
# └─────┴─────┴──────┴──────┘

Log output

No response

Issue description

This issue was created to fully resolve https://github.com/pola-rs/polars/issues/7731. After checking that issue out, I noticed that there were still examples provided that would result in a duplicate column name with the current version.

Expected behavior

The column names that pivot creates should be unique and follow the structure discussed in https://github.com/pola-rs/polars/issues/7731

In the case of the example, I believe the output should be

shape: (2, 4)
┌─────┬─────┬──────┬──────┐
│ b   ┆ c_x ┆ d_x  ┆ d_y  │
│ --- ┆ --- ┆ ---  ┆ ---  │
│ i64 ┆ i64 ┆ i64  ┆ i64  │
╞═════╪═════╪══════╪══════╡
│ 1   ┆ 1   ┆ 1    ┆ null │
│ 2   ┆ 2   ┆ null ┆ 2    │
└─────┴─────┴──────┴──────┘

however please correct me if I am wrong, or let me know if you disagree.

Installed versions

--------Version info---------
Polars:              0.19.8
Index type:          UInt32
Platform:            Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python:              3.11.4 (main, Jun  8 2023, 17:02:11) [GCC 9.4.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              <not installed>
gevent:              <not installed>
matplotlib:          <not installed>
numpy:               1.26.0
openpyxl:            <not installed>
pandas:              2.1.1
pyarrow:             <not installed>
pydantic:            <not installed>
pyiceberg:           <not installed>
pyxlsb:              <not installed>
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>

henryharbeck avatar Oct 11 '23 12:10 henryharbeck

Hitting this issue as well.

dan714 avatar Nov 02 '23 15:11 dan714

Hitting this issue as well.

Sorry - my issue was actually the resulting columns were a mix of floats and integers so 0.0 and 0 were separated by the pivot. Just needed to make the types consistent for my issue.

dan714 avatar Nov 02 '23 15:11 dan714

I can reproduce this issue.

I am also, getting an slightly different issue like this:

df = pl.DataFrame({'a':['beep', 'bop'],'b':['a','b'],'c':['s','f'],'d':[7,8], 'e':['x', 'y']}) df.pivot(values=['a'], index='b', columns=['c', 'e'], aggregate_function='count')

shape: (2, 5)

b s f x y
str u32 u32 u32 u32
"a" 1 null 1 null
"b" null 1 null 1

As above, expected format would be:

b s_x f _y
str u32 u32
"a" 1 null
"b" null 1

it seems like it could have a similar route cause

AaronCreighton avatar Nov 06 '23 01:11 AaronCreighton

(I could be mistaken @AaronCreighton - but from a quick look, it does seem your example is caused by a different issue.)

With regards to the initial example: Is the problem here that c and d both contain the value x?

Meaning there needs to be a check for intersecting values in any of the columns= and if so, modify the headers?

[/Users/user/git/polars/crates/polars-ops/src/frame/pivot/mod.rs:249] headers.apply_values(|v| Cow::from(format!("{column_column_name}{sep}{v}"))) = shape: (1,)
ChunkedArray: 'c' [str]
[
	"c_x"
]
[/Users/user/git/polars/crates/polars-ops/src/frame/pivot/mod.rs:249] headers.apply_values(|v| Cow::from(format!("{column_column_name}{sep}{v}"))) = shape: (2,)
ChunkedArray: 'd' [str]
[
	"d_x"
	"d_y"
]

cmdlineluser avatar Nov 06 '23 03:11 cmdlineluser

@cmdlineluser good to know, i can create a new issue for this if needed.

AaronCreighton avatar Nov 06 '23 03:11 AaronCreighton

This happens when there are multiple columns, but works fine when there are multiple values (see https://github.com/pola-rs/polars/issues/13470).

I think the fix is as simple as changing: https://github.com/pola-rs/polars/blob/7214de240f26307d3b40645b407bb52dae6507c3/crates/polars-ops/src/frame/pivot/mod.rs#L247

to:

if values.len() > 1 || columns.len() > 1 { 

Wainberg avatar Jan 06 '24 06:01 Wainberg

From discussion: the output should group by all columns, rather than treating them separately. So the output should be same as can be achieved currently with:

In [2]: df = pl.DataFrame({'a': [1,2], 'b': [1,2], 'c': ['x','x'], 'd': ['x', 'y']})
   ...: df.with_columns(c_d=pl.concat_str('c', 'd', separator='_')).pivot(values='a', index='b', columns='c_d')
Out[2]:
shape: (2, 3)
┌─────┬──────┬──────┐
│ b   ┆ x_x  ┆ x_y  │
│ --- ┆ ---  ┆ ---  │
│ i64 ┆ i64  ┆ i64  │
╞═════╪══════╪══════╡
│ 1   ┆ 1    ┆ null │
│ 2   ┆ null ┆ 2    │
└─────┴──────┴──────┘

This solves the duplicate columns issue, and aligns with tidyverse and pandas


Instead of

def my_pivot(df: pl.DataFrame, columns: list[str], index: str, values: str) -> pl.DataFrame:
    dfs: list[pl.DataFrame] = []
    for column in columns:
        dfs.append(df.group_by(index).agg(
            *[
                pl.col(values).filter(pl.col(column) == value).alias(f'{column}_{value}')
                for value in df[column].unique()
            ]
        ))
    result = reduce(lambda x, y: x.join(y, on=index), dfs)
    return result

more like

def my_pivot_new(df: pl.DataFrame, columns: list[str], index: str, values: str) -> pl.DataFrame:
    unique_values = df.select(struct=pl.struct(columns).unique())['struct']
    return df.group_by(index).agg(
        *[
            pl.col(values).filter(
                [pl.col(column) == value[column] for column in columns]
            ).alias('_'.join(value.values()))
            for value in unique_values
        ]
    )

MarcoGorelli avatar Jan 26 '24 19:01 MarcoGorelli