pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Decimal validation not fully supported

Open benlee1284 opened this issue 1 year ago • 5 comments

Describe the bug Validation of Decimal type not fully supported.

I have run into a number of issues when trying to use Decimals in pandera (e.g. failing to encode JSON when rendering SchemaErrors) but this one I couldn't avoid.

  • [x] I have checked that this issue has not already been reported.
  • [x] I have confirmed this bug exists on the latest version of pandera.
  • [ ] (optional) I have confirmed this bug exists on the main branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

from decimal import Decimal

import pandera.polars as pa
import polars as pl

schema = pa.DataFrameSchema(
    columns={'a': pa.Column(pl.Decimal()),},
    checks=[pa.Check(lambda x: False, element_wise=True)]
)

df = pl.DataFrame(data=[{'a': Decimal(1)}])

schema.validate(df)

>> PanicException: dtype Decimal(None, Some(0)) not supported

Expected behavior

A SchemaError / SchemaErrors

Desktop (please complete the following information):

  • OS: Windows 11 Pro (10.0.22631 Build 22631)
  • Browser: Chrome
  • Version: 0.19.3
  • Python Version: 3.11

Additional context

Full traceback:

thread '<unnamed>' panicked at py-polars\src\series\mod.rs:546:46:
dtype Decimal(None, Some(0)) not supported
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
--- PyO3 is resuming a panic after fetching a PanicException from Python. ---
Python stack trace below:
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\polars\expr\expr.py:4516, in Expr._map_batches_wrapper.__call__(self, *args, **kwargs)
   4515 def __call__(self, *args: Any, **kwargs: Any) -> Any:
-> 4516     result = self.function(*args, **kwargs)
   4517     if _check_for_numpy(result) and isinstance(result, np.ndarray):
   4518         result = pl.Series(result, dtype=self.return_dtype)

File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\polars\expr\expr.py:4862, in Expr.map_elements.<locals>.wrap_f(x)
   4860 with warnings.catch_warnings():
   4861     warnings.simplefilter("ignore", PolarsInefficientMapWarning)
-> 4862     return x.map_elements(
   4863         function, return_dtype=return_dtype, skip_nulls=skip_nulls
   4864     )

File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\polars\series\series.py:5504, in Series.map_elements(self, function, return_dtype, skip_nulls)
   5500     pl_return_dtype = py_type_to_dtype(return_dtype)
   5502 warn_on_inefficient_map(function, columns=[self.name], map_target="series")
   5503 return self._from_pyseries(
-> 5504     self._s.apply_lambda(function, pl_return_dtype, skip_nulls)
   5505 )

PanicException: dtype Decimal(None, Some(0)) not supported
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[18], line 1
----> 1 schema.validate(df)

File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\pandera\api\polars\container.py:58, in DataFrameSchema.validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
     54     if is_dataframe:
     55         # if validating a polars DataFrame, use the global config setting
     56         check_obj = check_obj.lazy()
---> 58     output = self.get_backend(check_obj).validate(
     59         check_obj=check_obj,
     60         schema=self,
     61         head=head,
     62         tail=tail,
     63         sample=sample,
     64         random_state=random_state,
     65         lazy=lazy,
     66         inplace=inplace,
     67     )
     69 if is_dataframe:
     70     output = output.collect()

File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\pandera\backends\polars\container.py:89, in DataFrameSchemaBackend.validate(self, check_obj, schema, head, tail, sample, random_state, lazy, inplace)
     81 core_checks = [
     82     (self.check_column_presence, (check_obj, schema, column_info)),
     83     (self.check_column_values_are_unique, (sample, schema)),
     84     (self.run_schema_component_checks, (sample, components, lazy)),
     85     (self.run_checks, (sample, schema)),
     86 ]
     88 for check, args in core_checks:
---> 89     results = check(*args)  # type: ignore[operator]
     90     if isinstance(results, CoreCheckResult):
     91         results = [results]

File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\pandera\validation_depth.py:79, in validate_scope.<locals>._wrapper.<locals>.wrapper(self, check_obj, *args, **kwargs)
     73     logger.debug(
     74         f"Skipping execution of check {func.__name__} since "
     75         "validation depth is set to SCHEMA_ONLY",
     76         stacklevel=2,
     77     )
     78     return CoreCheckResult(passed=True)
---> 79 return func(self, check_obj, *args, **kwargs)

File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\pandera\backends\polars\container.py:146, in DataFrameSchemaBackend.run_checks(self, check_obj, schema)
    143 for check_index, check in enumerate(schema.checks):
    144     try:
    145         check_results.append(
--> 146             self.run_check(check_obj, schema, check, check_index)
    147         )
    148     except SchemaDefinitionError:
    149         raise

File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\pandera\backends\polars\base.py:75, in PolarsSchemaBackend.run_check(self, check_obj, schema, check, check_index, *args)
     63 """Handle check results, raising SchemaError on check failure.
     64
     65 :param check_obj: data object to be validated.
   (...)
     71     False.
     72 """
     73 check_result: CheckResult = check(check_obj, *args)
---> 75 passed = check_result.check_passed.collect().item()
     76 failure_cases = None
     77 message = None

File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\polars\lazyframe\frame.py:1855, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
   1852 # Only for testing purposes atm.
   1853 callback = _kwargs.get("post_opt_callback")
-> 1855 return wrap_df(ldf.collect(callback))

PanicException: dtype Decimal(None, Some(0)) not supported

benlee1284 avatar Jun 17 '24 15:06 benlee1284

Can you try using the polars native Decimal dtype? https://docs.pola.rs/api/python/stable/reference/api/polars.datatypes.Decimal.html#polars.datatypes.Decimal

cosmicBboy avatar Jun 17 '24 22:06 cosmicBboy

@cosmicBboy I used that for the schema (see original snippet) but I can't see a way to use the polars Decimal type when instantiating a DataFrame My understanding was that you're meant to use the python decimal.Decimal type in polars Decimal columns

In fact if I cast a column to polars Decimal, it actually results in a python decimal.Decimal

from decimal import Decimal
import polars as pl

df = pl.DataFrame({'a': [Decimal('1')]})
cast = df.select(pl.col('a').cast(pl.Decimal()))
cast['a'][0]
>> Decimal('1')
type(cast['a'][0])
>> decimal.Decimal

Maybe I've missed a step here though!

benlee1284 avatar Jun 18 '24 10:06 benlee1284

So in pandera, element wise checks use map_elements under the hood: https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.map_elements.html#polars-expr-map-elements

And it looks like it currently does not support mapping a function over decimal dtype:

(
    pl.LazyFrame({"a": [Decimal(1)]})
    .with_columns(
        pl.col("a").map_elements(lambda x: x)
    ).collect()
)

error:

  File "/Users/nielsbantilan/miniconda3/envs/pandera-dev/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1817, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: dtype Decimal(None, Some(0)) not supported

You can raise this issue in the polars repo.

For now, I'd recommend using the vectorized checks that operate on the lazyframe itself: https://pandera.readthedocs.io/en/latest/polars.html#column-level-checks

cosmicBboy avatar Jun 18 '24 13:06 cosmicBboy

Ok cool thank you

So basically you're saying it'll fail for any DataFrame-level checks?

benlee1284 avatar Jun 18 '24 13:06 benlee1284

yeah, it'll fail for any element-wise check that operates on decimal types.

cosmicBboy avatar Jun 22 '24 14:06 cosmicBboy