polars
polars copied to clipboard
Panic trying to `shift(n).over(cols)` if `cols` contains a `pl.Decimal` type
Checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
import os
os.environ['POLARS_VERBOSE'] = "1"
import polars as pl
pl.Config.activate_decimals(True)
df = (pl
.DataFrame({
"k":[1,1],
"a":[1.1,2.2],
"b":[5,6]
})
.with_columns(pl.col("a").cast(pl.Decimal(scale=1,precision=10)))
)
df.drop(columns=['b']).with_columns(t2=pl.col('k').shift(1).over('a'))
Log output
---------------------------------------------------------------------------
PanicException Traceback (most recent call last)
/Users/primary/code/scratch/polars_melt_bug.ipynb Cell 2 line 1
6 pl.Config.activate_decimals(True)
8 df = (pl
9 .DataFrame({
10 "k":[1,1],
(...)
13 .with_columns(pl.col("a").cast(pl.Decimal(scale=1,precision=10)))
14 )
---> 16 df.with_columns(t2=pl.col('k').shift(1).over('a'))
File ~/code/scratch/.venv/lib/python3.11/site-packages/polars/dataframe/frame.py:8027, in DataFrame.with_columns(self, *exprs, **named_exprs)
7879 def with_columns(
7880 self,
7881 *exprs: IntoExpr | Iterable[IntoExpr],
7882 **named_exprs: IntoExpr,
7883 ) -> DataFrame:
7884 """
7885 Add columns to this DataFrame.
7886
(...)
8025
8026 """
-> 8027 return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)
File ~/code/scratch/.venv/lib/python3.11/site-packages/polars/utils/deprecation.py:100, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
95 @wraps(function)
96 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
97 _rename_keyword_argument(
98 old_name, new_name, kwargs, function.__name__, version
99 )
--> 100 return function(*args, **kwargs)
File ~/code/scratch/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:1788, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, no_optimization, streaming, _eager)
1775 comm_subplan_elim = False
1777 ldf = self._ldf.optimization_toggle(
1778 type_coercion,
1779 predicate_pushdown,
(...)
1786 _eager,
1787 )
-> 1788 return wrap_df(ldf.collect())
PanicException: group_tuples operation not supported for dtype decimal[10,1]
Issue description
If you instead run:df.with_columns(t2=pl.col('k').shift(1).over('a', 'b'))
You get this error: InvalidOperationError: vec_hash operation not supported for dtype decimal[10,1]
Expected behavior
shape: (2, 3)
┌─────┬───────────────┬──────┐
│ k ┆ a ┆ t2 │
│ --- ┆ ------------- ┆ --- │
│ i64 ┆ decimal[10,1] ┆ i64 │
╞═════╪═══════════════╪══════╡
│ 1 ┆ 1.1 ┆ null │
│ 1 ┆ 2.2 ┆ null │
└─────┴───────────────┴──────┘
Installed versions
--------Version info---------
Polars: 0.19.19
Index type: UInt32
Platform: macOS-14.1.1-x86_64-i386-64bit
Python: 3.11.2 (main, Mar 18 2023, 23:16:11) [Clang 14.0.0 (clang-1400.0.29.202)]
----Optional dependencies----
adbc_driver_manager: <not installed>
cloudpickle: <not installed>
connectorx: <not installed>
deltalake: <not installed>
fsspec: <not installed>
gevent: <not installed>
matplotlib: <not installed>
numpy: <not installed>
openpyxl: <not installed>
pandas: <not installed>
pyarrow: <not installed>
pydantic: <not installed>
pyiceberg: <not installed>
pyxlsb: <not installed>
sqlalchemy: <not installed>
xlsx2csv: <not installed>
xlsxwriter: <not installed>
As a workaround, I thought I could cast the pl.Decimal
columns to pl.Categorical
(or even pl.Utf8
):
df.with_columns(pl.col('a').cast(pl.Categorical))
In both cases, the operation fails with the error:
InvalidOperationError: casting from Decimal(10, 1) to LargeUtf8 not supported
I can however do, df.with_columns(pl.col('a').cast(pl.Float64).cast(pl.Utf8))
, but I've lost the precision at that point.
You can use pyarrow as a workaround
import pyarrow.compute as pc
import pyarrow as pa
(
df.with_columns(a_string = pl.from_arrow(pc.cast(df['a'].to_arrow(), pa.large_string())))
.with_columns(t2=pl.col('k').shift(1).over('a_string'))
)
Edit for another way where all the work is in the over and use map_batches instead of referencing the df
inside of itself. It also converts it to a pa.dictionary and extracts the indices rather than converting to string which should be more efficient.
(
df
.drop(columns=['b'])
.with_columns(
t2=pl.col('k').shift(1).over(
pl.col('a').map_batches(lambda x: (
pl.from_arrow(pc.dictionary_encode(x.to_arrow()).indices)
))
)
)
)