polars
polars copied to clipboard
pivot may still introduce duplicate column names in some cases
Checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
df = pl.DataFrame({'a': [1,2], 'b': [1,2], 'c': ['x','x'], 'd': ['x', 'y']})
df.pivot(values='a', index='b', columns=['c', 'd'])
# shape: (2, 4)
# ┌─────┬─────┬──────┬──────┐
# │ b ┆ x ┆ x ┆ y │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪══════╪══════╡
# │ 1 ┆ 1 ┆ 1 ┆ null │
# │ 2 ┆ 2 ┆ null ┆ 2 │
# └─────┴─────┴──────┴──────┘
Log output
No response
Issue description
This issue was created to fully resolve https://github.com/pola-rs/polars/issues/7731. After checking that issue out, I noticed that there were still examples provided that would result in a duplicate column name with the current version.
Expected behavior
The column names that pivot
creates should be unique and follow the structure discussed in https://github.com/pola-rs/polars/issues/7731
In the case of the example, I believe the output should be
shape: (2, 4)
┌─────┬─────┬──────┬──────┐
│ b ┆ c_x ┆ d_x ┆ d_y │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪══════╪══════╡
│ 1 ┆ 1 ┆ 1 ┆ null │
│ 2 ┆ 2 ┆ null ┆ 2 │
└─────┴─────┴──────┴──────┘
however please correct me if I am wrong, or let me know if you disagree.
Installed versions
--------Version info---------
Polars: 0.19.8
Index type: UInt32
Platform: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python: 3.11.4 (main, Jun 8 2023, 17:02:11) [GCC 9.4.0]
----Optional dependencies----
adbc_driver_sqlite: <not installed>
cloudpickle: <not installed>
connectorx: <not installed>
deltalake: <not installed>
fsspec: <not installed>
gevent: <not installed>
matplotlib: <not installed>
numpy: 1.26.0
openpyxl: <not installed>
pandas: 2.1.1
pyarrow: <not installed>
pydantic: <not installed>
pyiceberg: <not installed>
pyxlsb: <not installed>
sqlalchemy: <not installed>
xlsx2csv: <not installed>
xlsxwriter: <not installed>
Hitting this issue as well.
Hitting this issue as well.
Sorry - my issue was actually the resulting columns were a mix of floats and integers so 0.0 and 0 were separated by the pivot. Just needed to make the types consistent for my issue.
I can reproduce this issue.
I am also, getting an slightly different issue like this:
df = pl.DataFrame({'a':['beep', 'bop'],'b':['a','b'],'c':['s','f'],'d':[7,8], 'e':['x', 'y']}) df.pivot(values=['a'], index='b', columns=['c', 'e'], aggregate_function='count')
shape: (2, 5)
b | s | f | x | y |
---|---|---|---|---|
str | u32 | u32 | u32 | u32 |
"a" | 1 | null | 1 | null |
"b" | null | 1 | null | 1 |
As above, expected format would be:
b | s_x | f _y |
---|---|---|
str | u32 | u32 |
"a" | 1 | null |
"b" | null | 1 |
it seems like it could have a similar route cause
(I could be mistaken @AaronCreighton - but from a quick look, it does seem your example is caused by a different issue.)
With regards to the initial example: Is the problem here that c
and d
both contain the value x
?
Meaning there needs to be a check for intersecting values in any of the columns=
and if so, modify the headers?
[/Users/user/git/polars/crates/polars-ops/src/frame/pivot/mod.rs:249] headers.apply_values(|v| Cow::from(format!("{column_column_name}{sep}{v}"))) = shape: (1,)
ChunkedArray: 'c' [str]
[
"c_x"
]
[/Users/user/git/polars/crates/polars-ops/src/frame/pivot/mod.rs:249] headers.apply_values(|v| Cow::from(format!("{column_column_name}{sep}{v}"))) = shape: (2,)
ChunkedArray: 'd' [str]
[
"d_x"
"d_y"
]
@cmdlineluser good to know, i can create a new issue for this if needed.
This happens when there are multiple columns
, but works fine when there are multiple values
(see https://github.com/pola-rs/polars/issues/13470).
I think the fix is as simple as changing: https://github.com/pola-rs/polars/blob/7214de240f26307d3b40645b407bb52dae6507c3/crates/polars-ops/src/frame/pivot/mod.rs#L247
to:
if values.len() > 1 || columns.len() > 1 {
From discussion: the output should group by all columns
, rather than treating them separately. So the output should be same as can be achieved currently with:
In [2]: df = pl.DataFrame({'a': [1,2], 'b': [1,2], 'c': ['x','x'], 'd': ['x', 'y']})
...: df.with_columns(c_d=pl.concat_str('c', 'd', separator='_')).pivot(values='a', index='b', columns='c_d')
Out[2]:
shape: (2, 3)
┌─────┬──────┬──────┐
│ b ┆ x_x ┆ x_y │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪══════╪══════╡
│ 1 ┆ 1 ┆ null │
│ 2 ┆ null ┆ 2 │
└─────┴──────┴──────┘
This solves the duplicate columns issue, and aligns with tidyverse and pandas
Instead of
def my_pivot(df: pl.DataFrame, columns: list[str], index: str, values: str) -> pl.DataFrame:
dfs: list[pl.DataFrame] = []
for column in columns:
dfs.append(df.group_by(index).agg(
*[
pl.col(values).filter(pl.col(column) == value).alias(f'{column}_{value}')
for value in df[column].unique()
]
))
result = reduce(lambda x, y: x.join(y, on=index), dfs)
return result
more like
def my_pivot_new(df: pl.DataFrame, columns: list[str], index: str, values: str) -> pl.DataFrame:
unique_values = df.select(struct=pl.struct(columns).unique())['struct']
return df.group_by(index).agg(
*[
pl.col(values).filter(
[pl.col(column) == value[column] for column in columns]
).alias('_'.join(value.values()))
for value in unique_values
]
)