polars icon indicating copy to clipboard operation
polars copied to clipboard

Panic trying to `shift(n).over(cols)` if `cols` contains a `pl.Decimal` type

Open jr200 opened this issue 1 year ago • 2 comments

Checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import os
os.environ['POLARS_VERBOSE'] = "1"

import polars as pl

pl.Config.activate_decimals(True)

df = (pl
 .DataFrame({
    "k":[1,1],
    "a":[1.1,2.2],
    "b":[5,6]
    })
  .with_columns(pl.col("a").cast(pl.Decimal(scale=1,precision=10)))
)

df.drop(columns=['b']).with_columns(t2=pl.col('k').shift(1).over('a'))

Log output

---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
/Users/primary/code/scratch/polars_melt_bug.ipynb Cell 2 line 1
      6 pl.Config.activate_decimals(True)
      8 df = (pl
      9  .DataFrame({
     10     "k":[1,1],
   (...)
     13   .with_columns(pl.col("a").cast(pl.Decimal(scale=1,precision=10)))
     14 )
---> 16 df.with_columns(t2=pl.col('k').shift(1).over('a'))

File ~/code/scratch/.venv/lib/python3.11/site-packages/polars/dataframe/frame.py:8027, in DataFrame.with_columns(self, *exprs, **named_exprs)
   7879 def with_columns(
   7880     self,
   7881     *exprs: IntoExpr | Iterable[IntoExpr],
   7882     **named_exprs: IntoExpr,
   7883 ) -> DataFrame:
   7884     """
   7885     Add columns to this DataFrame.
   7886 
   (...)
   8025 
   8026     """
-> 8027     return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)

File ~/code/scratch/.venv/lib/python3.11/site-packages/polars/utils/deprecation.py:100, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     95 @wraps(function)
     96 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     97     _rename_keyword_argument(
     98         old_name, new_name, kwargs, function.__name__, version
     99     )
--> 100     return function(*args, **kwargs)

File ~/code/scratch/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:1788, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, no_optimization, streaming, _eager)
   1775     comm_subplan_elim = False
   1777 ldf = self._ldf.optimization_toggle(
   1778     type_coercion,
   1779     predicate_pushdown,
   (...)
   1786     _eager,
   1787 )
-> 1788 return wrap_df(ldf.collect())

PanicException: group_tuples operation not supported for dtype decimal[10,1]

Issue description

If you instead run:df.with_columns(t2=pl.col('k').shift(1).over('a', 'b')) You get this error: InvalidOperationError: vec_hash operation not supported for dtype decimal[10,1]

Expected behavior

shape: (2, 3)
┌─────┬───────────────┬──────┐
│ k   ┆ a             ┆ t2   │
│ --- ┆ ------------- ┆ ---  │
│ i64 ┆ decimal[10,1] ┆ i64  │
╞═════╪═══════════════╪══════╡
│ 1   ┆           1.1 ┆ null │
│ 1   ┆           2.2 ┆ null │
└─────┴───────────────┴──────┘

Installed versions

--------Version info---------
Polars:               0.19.19
Index type:           UInt32
Platform:             macOS-14.1.1-x86_64-i386-64bit
Python:               3.11.2 (main, Mar 18 2023, 23:16:11) [Clang 14.0.0 (clang-1400.0.29.202)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
matplotlib:           <not installed>
numpy:                <not installed>
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              <not installed>
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

jr200 avatar Dec 08 '23 11:12 jr200

As a workaround, I thought I could cast the pl.Decimal columns to pl.Categorical (or even pl.Utf8):

df.with_columns(pl.col('a').cast(pl.Categorical))

In both cases, the operation fails with the error:

InvalidOperationError: casting from Decimal(10, 1) to LargeUtf8 not supported

I can however do, df.with_columns(pl.col('a').cast(pl.Float64).cast(pl.Utf8)), but I've lost the precision at that point.

jr200 avatar Dec 08 '23 13:12 jr200

You can use pyarrow as a workaround

import pyarrow.compute as pc
import pyarrow as pa
(
    df.with_columns(a_string = pl.from_arrow(pc.cast(df['a'].to_arrow(), pa.large_string())))
    .with_columns(t2=pl.col('k').shift(1).over('a_string'))
)

Edit for another way where all the work is in the over and use map_batches instead of referencing the df inside of itself. It also converts it to a pa.dictionary and extracts the indices rather than converting to string which should be more efficient.

(
    df
    .drop(columns=['b'])
    .with_columns(
        t2=pl.col('k').shift(1).over(
            pl.col('a').map_batches(lambda x: (
                pl.from_arrow(pc.dictionary_encode(x.to_arrow()).indices)
            ))
        )
    )
)

deanm0000 avatar Dec 12 '23 16:12 deanm0000