pandera
pandera copied to clipboard
Decimal validation not fully supported
Describe the bug Validation of Decimal type not fully supported.
I have run into a number of issues when trying to use Decimals in pandera (e.g. failing to encode JSON when rendering SchemaErrors) but this one I couldn't avoid.
- [x] I have checked that this issue has not already been reported.
- [x] I have confirmed this bug exists on the latest version of pandera.
- [ ] (optional) I have confirmed this bug exists on the main branch of pandera.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
from decimal import Decimal
import pandera.polars as pa
import polars as pl
schema = pa.DataFrameSchema(
columns={'a': pa.Column(pl.Decimal()),},
checks=[pa.Check(lambda x: False, element_wise=True)]
)
df = pl.DataFrame(data=[{'a': Decimal(1)}])
schema.validate(df)
>> PanicException: dtype Decimal(None, Some(0)) not supported
Expected behavior
A SchemaError / SchemaErrors
Desktop (please complete the following information):
- OS: Windows 11 Pro (10.0.22631 Build 22631)
- Browser: Chrome
- Version: 0.19.3
- Python Version: 3.11
Additional context
Full traceback:
thread '<unnamed>' panicked at py-polars\src\series\mod.rs:546:46:
dtype Decimal(None, Some(0)) not supported
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
--- PyO3 is resuming a panic after fetching a PanicException from Python. ---
Python stack trace below:
---------------------------------------------------------------------------
PanicException Traceback (most recent call last)
File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\polars\expr\expr.py:4516, in Expr._map_batches_wrapper.__call__(self, *args, **kwargs)
4515 def __call__(self, *args: Any, **kwargs: Any) -> Any:
-> 4516 result = self.function(*args, **kwargs)
4517 if _check_for_numpy(result) and isinstance(result, np.ndarray):
4518 result = pl.Series(result, dtype=self.return_dtype)
File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\polars\expr\expr.py:4862, in Expr.map_elements.<locals>.wrap_f(x)
4860 with warnings.catch_warnings():
4861 warnings.simplefilter("ignore", PolarsInefficientMapWarning)
-> 4862 return x.map_elements(
4863 function, return_dtype=return_dtype, skip_nulls=skip_nulls
4864 )
File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\polars\series\series.py:5504, in Series.map_elements(self, function, return_dtype, skip_nulls)
5500 pl_return_dtype = py_type_to_dtype(return_dtype)
5502 warn_on_inefficient_map(function, columns=[self.name], map_target="series")
5503 return self._from_pyseries(
-> 5504 self._s.apply_lambda(function, pl_return_dtype, skip_nulls)
5505 )
PanicException: dtype Decimal(None, Some(0)) not supported
---------------------------------------------------------------------------
PanicException Traceback (most recent call last)
Cell In[18], line 1
----> 1 schema.validate(df)
File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\pandera\api\polars\container.py:58, in DataFrameSchema.validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
54 if is_dataframe:
55 # if validating a polars DataFrame, use the global config setting
56 check_obj = check_obj.lazy()
---> 58 output = self.get_backend(check_obj).validate(
59 check_obj=check_obj,
60 schema=self,
61 head=head,
62 tail=tail,
63 sample=sample,
64 random_state=random_state,
65 lazy=lazy,
66 inplace=inplace,
67 )
69 if is_dataframe:
70 output = output.collect()
File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\pandera\backends\polars\container.py:89, in DataFrameSchemaBackend.validate(self, check_obj, schema, head, tail, sample, random_state, lazy, inplace)
81 core_checks = [
82 (self.check_column_presence, (check_obj, schema, column_info)),
83 (self.check_column_values_are_unique, (sample, schema)),
84 (self.run_schema_component_checks, (sample, components, lazy)),
85 (self.run_checks, (sample, schema)),
86 ]
88 for check, args in core_checks:
---> 89 results = check(*args) # type: ignore[operator]
90 if isinstance(results, CoreCheckResult):
91 results = [results]
File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\pandera\validation_depth.py:79, in validate_scope.<locals>._wrapper.<locals>.wrapper(self, check_obj, *args, **kwargs)
73 logger.debug(
74 f"Skipping execution of check {func.__name__} since "
75 "validation depth is set to SCHEMA_ONLY",
76 stacklevel=2,
77 )
78 return CoreCheckResult(passed=True)
---> 79 return func(self, check_obj, *args, **kwargs)
File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\pandera\backends\polars\container.py:146, in DataFrameSchemaBackend.run_checks(self, check_obj, schema)
143 for check_index, check in enumerate(schema.checks):
144 try:
145 check_results.append(
--> 146 self.run_check(check_obj, schema, check, check_index)
147 )
148 except SchemaDefinitionError:
149 raise
File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\pandera\backends\polars\base.py:75, in PolarsSchemaBackend.run_check(self, check_obj, schema, check, check_index, *args)
63 """Handle check results, raising SchemaError on check failure.
64
65 :param check_obj: data object to be validated.
(...)
71 False.
72 """
73 check_result: CheckResult = check(check_obj, *args)
---> 75 passed = check_result.check_passed.collect().item()
76 failure_cases = None
77 message = None
File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\polars\lazyframe\frame.py:1855, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
1852 # Only for testing purposes atm.
1853 callback = _kwargs.get("post_opt_callback")
-> 1855 return wrap_df(ldf.collect(callback))
PanicException: dtype Decimal(None, Some(0)) not supported
Can you try using the polars native Decimal dtype? https://docs.pola.rs/api/python/stable/reference/api/polars.datatypes.Decimal.html#polars.datatypes.Decimal
@cosmicBboy I used that for the schema (see original snippet) but I can't see a way to use the polars Decimal type when instantiating a DataFrame My understanding was that you're meant to use the python decimal.Decimal type in polars Decimal columns
In fact if I cast a column to polars Decimal, it actually results in a python decimal.Decimal
from decimal import Decimal
import polars as pl
df = pl.DataFrame({'a': [Decimal('1')]})
cast = df.select(pl.col('a').cast(pl.Decimal()))
cast['a'][0]
>> Decimal('1')
type(cast['a'][0])
>> decimal.Decimal
Maybe I've missed a step here though!
So in pandera, element wise checks use map_elements under the hood: https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.map_elements.html#polars-expr-map-elements
And it looks like it currently does not support mapping a function over decimal dtype:
(
pl.LazyFrame({"a": [Decimal(1)]})
.with_columns(
pl.col("a").map_elements(lambda x: x)
).collect()
)
error:
File "/Users/nielsbantilan/miniconda3/envs/pandera-dev/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1817, in collect
return wrap_df(ldf.collect(callback))
^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: dtype Decimal(None, Some(0)) not supported
You can raise this issue in the polars repo.
For now, I'd recommend using the vectorized checks that operate on the lazyframe itself: https://pandera.readthedocs.io/en/latest/polars.html#column-level-checks
Ok cool thank you
So basically you're saying it'll fail for any DataFrame-level checks?
yeah, it'll fail for any element-wise check that operates on decimal types.