polars icon indicating copy to clipboard operation
polars copied to clipboard

Confusing (& wrong) behavior when using `with_columns` incorrectly

Open mkleinbort-ic opened this issue 2 years ago • 3 comments

Polars version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Issue description

I accidentally wrote this code:

import polars as pl

df = pl.DataFrame({
    'x1': [1,2,4,8,16,32],
    'x2': [1,2,3,4,5,6]
})

df.with_columns(pctChange = pl.col(['x1', 'x2']).pct_change())

>>>
shape: (6, 3)
┌─────┬─────┬───────────┐
│ x1  ┆ x2  ┆ pctChange │
│ --- ┆ --- ┆ ---       │
│ i64 ┆ i64 ┆ f64       │
╞═════╪═════╪═══════════╡
│ 1   ┆ 1   ┆ null      │
│ 2   ┆ 2   ┆ 1.0       │
│ 4   ┆ 3   ┆ 0.5       │
│ 8   ┆ 4   ┆ 0.333333  │
│ 16  ┆ 5   ┆ 0.25      │
│ 32  ┆ 6   ┆ 0.2       │
└─────┴─────┴───────────┘

This is the result I'd expect if I were taking the pct_change of the x2 column, but it quietly ignores x1.

Two behaviours seem appropiate to me:

  1. Raise an error when assigning a column using a dataframe
  2. Create a struct type column.
# Should behave like df.with_columns(pctChange = pl.struct(pl.col(['x1', 'x2']).pct_change()))
df.with_columns(pctChange = pl.col(['x1', 'x2']).pct_change())
>>>
shape: (6, 3)
┌─────┬─────┬────────────────┐
│ x1  ┆ x2  ┆ pctChange      │
│ --- ┆ --- ┆ ---            │
│ i64 ┆ i64 ┆ struct[2]      │
╞═════╪═════╪════════════════╡
│ 1   ┆ 1   ┆ {null,null}    │
│ 2   ┆ 2   ┆ {1.0,1.0}      │
│ 4   ┆ 3   ┆ {1.0,0.5}      │
│ 8   ┆ 4   ┆ {1.0,0.333333} │
│ 16  ┆ 5   ┆ {1.0,0.25}     │
│ 32  ┆ 6   ┆ {1.0,0.2}      │
└─────┴─────┴────────────────┘

In either case, the current behavior definitively violated the "don't surprise programmers" mantra.

Reproducible example

import polars as pl

df = pl.DataFrame({
    'x1': [1,2,4,8,16,32],
    'x2': [1,2,3,4,5,6]
})

df.with_columns(pctChange = pl.col(['x1', 'x2']).pct_change())

Expected behavior

Should return the same as

import polars as pl

df = pl.DataFrame({
    'x1': [1,2,4,8,16,32],
    'x2': [1,2,3,4,5,6]
})

df.with_columns(pctChange = pl.struct(pl.col(['x1', 'x2']).pct_change()))

Or raise an error

Installed versions

---Version info---
Polars: 0.15.15
Index type: UInt32
Platform: Windows-10-10.0.22621-SP0
Python: 3.10.7 (tags/v3.10.7:6cc6b13, Sep  5 2022, 14:08:36) [MSC v.1933 64 bit (AMD64)]
---Optional dependencies---
pyarrow: 8.0.0
pandas: 1.5.2
numpy: 1.22.4
fsspec: 2022.8.2
connectorx: 0.3.1
xlsx2csv: <not installed>
matplotlib: 3.6.2

mkleinbort-ic avatar Jan 27 '23 10:01 mkleinbort-ic

@alexander-beedie could you take this one? This is related to the keyword argument assignment.

@mkleinbort-ic You can use the explicit alias() until this is fixed.

ritchie46 avatar Jan 27 '23 10:01 ritchie46

I'm happy on my end, it's just a sharp corner I thought I'd raise

mkleinbort-ic avatar Jan 27 '23 13:01 mkleinbort-ic

I'm happy on my end, it's just a sharp corner I thought I'd raise

@mkleinbort-ic: and many thanks for that - I've found a way to automatically structify this type of call (which does look like the right way to handle things), so the hoped-for behaviour should work by default in an upcoming release.

Update:

  • Note that the auto-structify behaviour is considered experimental, and requires opt-in via...

    pl.Config.set_auto_structify(True)
    

alexander-beedie avatar Jan 27 '23 15:01 alexander-beedie