pandera
pandera copied to clipboard
Implement basic validation backend for Ibis tables
Taking a stab at #1105.
~No need to review, still very much a WIP. Working on implementing the backend classes now.~
Initial goal is to get this example running:
>>> import ibis
>>> import pandas as pd
>>> import pandera.ibis as pa
>>>
>>> df = pd.DataFrame({
... "probability": [0.1, 0.4, 0.52, 0.23, 0.8, 0.76],
... "category": ["dog", "dog", "cat", "duck", "dog", "dog"],
... })
>>> t = ibis.memtable(df, name="t")
>>>
>>> schema_withchecks = pa.DataFrameSchema({
... "probability": pa.Column(
... float, pa.Check(lambda s: (s >= 0) & (s <= 1))),
...
... # check that the "category" column contains a few discrete
... # values, and the majority of the entries are dogs.
... "category": pa.Column(
... str, [
... pa.Check(lambda s: s.isin(["dog", "cat", "duck"])),
... pa.Check(lambda s: (s == "dog").mean() > 0.5),
... ]),
... })
>>>
>>> schema_withchecks.validate(t)[["probability", "category"]]
probability category
0 0.10 dog
1 0.40 dog
2 0.52 cat
3 0.23 duck
4 0.80 dog
5 0.76 dog
See https://github.com/unionai-oss/pandera/pull/1451#issuecomment-1877934136 for the current state.
Codecov Report
Attention: Patch coverage is 78.35052% with 63 lines in your changes missing coverage. Please review.
Project coverage is 92.80%. Comparing base (
4df61da) to head (612d65a). Report is 131 commits behind head on ibis-dev.
Additional details and impacted files
@@ Coverage Diff @@
## ibis-dev #1451 +/- ##
============================================
- Coverage 94.29% 92.80% -1.50%
============================================
Files 91 130 +39
Lines 7024 9403 +2379
============================================
+ Hits 6623 8726 +2103
- Misses 401 677 +276
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
This is a great start @deepyaman ! I just created a ibis-dev branch that we can merge all of our work into: https://github.com/unionai-oss/pandera/tree/ibis-dev
@cosmicBboy Happy belated New Year! Hope you enjoyed the holidays.
I made some (admittedly-slow) progress on the Ibis backend, and I'd be happy to get a review to make sure things are on the right track. Been learning a lot about how Pandera works.
This is very incomplete, but I think I'm implemented a happy and unhappy path that work:
Happy path
>>> import ibis
>>> import pandas as pd
>>> import pandera.ibis as pa
/opt/miniconda3/envs/pandera-dev/lib/python3.11/site-packages/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
warnings.warn(
>>>
>>> df = pd.DataFrame({
... "probability": [0.1, 0.4, 0.52, 0.23, 0.8, 0.76],
... "category": ["dog", "dog", "cat", "duck", "dog", "dog"],
... })
>>> t = ibis.memtable(df, name="t")
>>> schema_withchecks = pa.DataFrameSchema({"probability": pa.Column(float)})
>>> schema_withchecks.validate(t)[["probability", "category"]]
r0 := InMemoryTable
data:
PandasDataFrameProxy:
probability category
0 0.10 dog
1 0.40 dog
2 0.52 cat
3 0.23 duck
4 0.80 dog
5 0.76 dog
Selection[r0]
selections:
probability: r0.probability
category: r0.category
>>>
Unhappy path
>>> import ibis
>>> import pandas as pd
>>> import pandera.ibis as pa
/opt/miniconda3/envs/pandera-dev/lib/python3.11/site-packages/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
warnings.warn(
>>>
>>> df = pd.DataFrame({
... "probability": [1, 4, 52, 23, 8, 76],
... "category": ["dog", "dog", "cat", "duck", "dog", "dog"],
... })
>>> t = ibis.memtable(df, name="t")
>>> schema_withchecks = pa.DataFrameSchema({"probability": pa.Column(float)})
>>> schema_withchecks.validate(t)[["probability", "category"]]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/deepyaman/github/deepyaman/pandera/pandera/api/ibis/container.py", line 80, in validate
return self.get_backend(check_obj).validate(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/deepyaman/github/deepyaman/pandera/pandera/backends/ibis/container.py", line 72, in validate
error_handler.collect_error(
File "/Users/deepyaman/github/deepyaman/pandera/pandera/error_handlers.py", line 38, in collect_error
raise schema_error from original_exc
File "/Users/deepyaman/github/deepyaman/pandera/pandera/backends/ibis/container.py", line 103, in run_schema_component_checks
result = schema_component.validate(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/deepyaman/github/deepyaman/pandera/pandera/api/pandas/components.py", line 169, in validate
return self.get_backend(check_obj).validate(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/deepyaman/github/deepyaman/pandera/pandera/backends/ibis/components.py", line 69, in validate
error_handler.collect_error( # Why indent (unlike in container.py)?
File "/Users/deepyaman/github/deepyaman/pandera/pandera/error_handlers.py", line 38, in collect_error
raise schema_error from original_exc
pandera.errors.SchemaError: expected column 'probability' to have type float64, got int64
>>>
As a next step (if this looks good), I can probably work on writing tests for this functionality, and then implement additional checks (with the goal of getting the example in the PR description working first). Probably start with getting the value check on the same column (pa.Check(lambda s: (s >= 0) & (s <= 1))) working.
I was searching to see if this exists and what a delight to see your name here @deepyaman thanks for kicking this off!
Thanks for this PR @deepyaman thanks so much for all your work on kicking this effort off!
I'm still in the process of looking through all the changes in this PR, but the examples you provide here is the right direction. If you can get some tests for these written, we can check this into the ibis-dev branch so we can get the initial foothold on this feature merged into the main repo.
Super excited to get this shipped! 🚀
Thanks for this PR @deepyaman thanks so much for all your work on kicking this effort off!
I'm still in the process of looking through all the changes in this PR, but the examples you provide here is the right direction. If you can get some tests for these written, we can check this into the
ibis-devbranch so we can get the initial foothold on this feature merged into the main repo.Super excited to get this shipped! 🚀
@cosmicBboy Sounds great! I'm on holiday/unavailable until the end of the month, but I will prioritize adding the tests after that (and then continue making incremental process)!
Cool, enjoy your time off! Btw the beta release for polars support is out https://github.com/unionai-oss/pandera/releases/tag/v0.19.0b0
Part of it comes with a pandera.api.dataframe module that generalizes some of the common functionality between pandas and polars. It currently only does this for DataFrameModel, but I'm planning on doing the same for DataFrameSchema.
You may want to leverage these classes in your implementation to reduce repetitive logic, depending on how similar/different it is from polars/pandas.
Hey @cosmicBboy @deepyaman I'd like to jump in and get this over the line. It's a little hard to follow where's best to get stuck in, do either of you have any recommendations?
Thanks @datajoely! I think @deepyaman may still be on leave, so maybe we wait on him to provide more context, but at a high level, there are parts that are fairly easy to parallelize. For example, for the core pandera functionality with the polars integration, we have the following modules:
- The api: these are the user-facing schema definitions: https://github.com/unionai-oss/pandera/tree/main/pandera/api/polars
container.py:DataFrame/LazyFramecomponents.py: Column definitionmodel.py: theDataFrameModelclass-based syntaxmodel_config.py: configurations forDataFrameModel
- The backend: this is the implementation of how to actually validate data: https://github.com/unionai-oss/pandera/tree/main/pandera/backends/polars
- These generally have modules associated with the
api - Additionally it has the
checks.pybackend for how to run the built-in or user-defined checks, andbuiltin_checks.pythat provide the implementations for the built-in pandera checks. - Backend registration function: https://github.com/unionai-oss/pandera/blob/main/pandera/backends/polars/register.py
- These generally have modules associated with the
- The type engine: https://github.com/unionai-oss/pandera/blob/main/pandera/engines/polars_engine.py
- This contains the pandera dtype translation layer from the dataframe library to dtypes that pandera can understand.
- Generic classes for python type annotations: https://github.com/unionai-oss/pandera/blob/main/pandera/typing/polars.py
- Finally, the library integration entrypoint: https://github.com/unionai-oss/pandera/blob/main/pandera/polars.py
Work in each of these sections is fairly parallelizable. To help with implementation, pandera also provides library-agnostic base classes for some of the common class definitions:
- API spec for generic dataframes: https://github.com/unionai-oss/pandera/tree/main/pandera/api/dataframe
- the user-facing class for Checks: https://github.com/unionai-oss/pandera/blob/main/pandera/api/checks.py
This PR already implements a bunch of the pieces above. Probably the best way to start is to run the happy path code path described here, and start poking around.
Thanks @datajoely! I think @deepyaman may still be on leave, so maybe we wait on him to provide more context, but at a high level, there are parts that are fairly easy to parallelize. For example, for the core pandera functionality with the polars integration, we have the following modules:
[...]
@cosmicBboy Thanks! This is actually a very helpful breakdown. I also briefly chatted with @datajoely earlier today about this (should've updated the issue), but it seems he's taken a look at the Polars backend more recently, and will push up a branch with some things he'd been trying. As a first step, it may make sense to update this PR to leverage some of the generalized functionality more; since I wrote this last December, I referenced the polars-dev branch a lot, but it was a WIP and things must have changed. I just haven't gotten a chance to look again yet. :)
I will have some time to dedicate to this once I get the stuff I'm currently working on out the door, hopefully by later this month. 🤞 But also happy to try to unblock @datajoely if anything is there I can do sooner.
Excellent - super helpful :)
So I've raised #1651 in draft - it would be great to get some thoughts whether it's smarter to mirror the quite complicated Polars approach or to continue with this basic approach.
FYI I am working on rebasing this to main today and leveraging the new constructs mentioned! Hope to push something up soon.
Amazing @deepyaman! let me know if you need any help
Amazing @deepyaman! let me know if you need any help
@cosmicBboy Thanks! Can you update unionai-oss:ibis-dev to main?
Also, would it be possible to set up ibis-dev to run the checks, at least for Ibis, if we're using that as the "main" while building out Ibis integration?
Amazing! I just updated the ibis-dev branch to catch up with main
Seems I hadn't fully rebased onto latest main before? In any case, done now, and the happy/unhappy paths from https://github.com/unionai-oss/pandera/pull/1451#issuecomment-1877934136 are still passing.
Let me add some more tests; other than that, I think I need to make sure Ibis is included in the requirements, and I was running into a few issues running mypy locally (getting a bunch of errors on other parts of the codebase), but should be almost ready to be merged. 🤞
I've also talked briefly to @datajoely. My suggestion was:
Maybe you can look at implementing the functionality corresponding to
tests/polars/test_polars_dtypes.py, around coercion? I haven't really looked into it. Can start with the numeric types (I've only added a couple of numeric types so far).I was planning to try adding a non-dtype check next myself. Once that machinery is there, I think can also parallelize implementation of checks.
@cosmicBboy if you have a better suggestion for where to get started/don't think coercion is a good place, I think that would be very much welcome.
This is 🔥 @deepyaman, super excited to get this first pass merged into the dev branch.
if you have a better suggestion for where to get started/don't think coercion is a good place, I think that would be very much welcome
Yes, I think the dtypes tests are the best place to start, both for strict type checking (no coercion) and also coercion.
I haven't tested the changes in https://github.com/unionai-oss/pandera/pull/1451/commits/5bf4279c6ffe99268ed0466bd1a82a555faffafa that well; I do think I broke using native types like int and str for now (have to specify ibis.expr.datatypes.Int64 or ibis.expr.datatypes.String, respectively). Will try to resolve this.
Temporarily changed the base, because I wanted to see CI run. (Update: I see; the annotation-unchecked messages are just notes that can be ignored. Then what I'm seeing locally is good, and I'll just fix the actual type errors!)
FYI @cosmicBboy I rebased this onto latest main (to get CI to run). The main commit I rebased on top of is a041e060c9e8ddaed5e28e27fac86560148d8935.
Looks like Pandera still supports 3.8, despite following NEP-29 (which is fine). However, Ibis dropped 3.8 support a while back, so ibis-framework resolves to 5.1.0 on 3.8—a very old version I don't plan to build for. Shall we skip Ibis tests on 3.8?
I think this is in a reviewable (potentially mergeable?) state. I will take a look if I can add some more tests. Would it be possible to rebase-merge this to ibis-dev instead of squashing? Have tried to maintain a clean history, and I think it may help in trying to debug some decisions as continue development on the ibis-dev branch.
Thanks @deepyaman, I'll take a look at this PR next week.
Shall we skip Ibis tests on 3.8?
Yes!
Looks like Pandera still supports 3.8, despite following NEP-29 (which is fine)
Yes, I've been meaning to drop 3.8 support, maybe this happens when ibis-dev is merged onto main when it's ready for a beta release?
Would it be possible to rebase-merge this to ibis-dev instead of squashing?
Yep, we can do this
Sorry for the delay on reviewing this @deepyaman, will do so this coming week
@deepyaman I just updated the ibis-dev branch to main (@c895dc4), would you mind regenerating the requirements files with make nox-requirements?
@deepyaman I just updated the
ibis-devbranch tomain(@c895dc4), would you mind regenerating the requirements files withmake nox-requirements?
@cosmicBboy Done, although CI seems to be choking on some installation issue.
thanks! yeah haven't diagnosed the CI issue yet, for some reason this happens fairly often... restarting CI eventually resolves it
@deepyaman this initial PR looks good to me! There may be more things we'll refactor as we go along, but I want to get this into the ibis-dev branch.
Just to confirm, we want to rebase all the changes instead of squashing right?
@deepyaman this initial PR looks good to me! There may be more things we'll refactor as we go along, but I want to get this into the
ibis-devbranch.Just to confirm, we want to rebase all the changes instead of squashing right?
That would be great! Up to you what you want to do when get it into main eventually, but for now I think it will be helpful for me to be able to track some of the changes/decisions, especially for this PR that changed a lot of things. :)
@deepyaman do you mind if we collapse some of the commits into one?
For example, the following can probably be squashed into one:
- Implement happy-path validation for floating types 5ed2d6c
- Implement an unhappy path (got int64, not float64) c1c5c87
then:
- Add DataFrameModel, DataFrameSchema for ibis 3936bf5
- Implement a basic DataFrameSchema class for Ibis 25438e7
There's probably more that make sense to squash, but I just wanted to make sure the commits are as concise as possible.
just rebased, squashing commits that can go together
@deepyaman do you mind if we collapse some of the commits into one?
For example, the following can probably be squashed into one:
- Implement happy-path validation for floating types 5ed2d6c
- Implement an unhappy path (got int64, not float64) c1c5c87
then:
- Add DataFrameModel, DataFrameSchema for ibis 3936bf5
- Implement a basic DataFrameSchema class for Ibis 25438e7
There's probably more that make sense to squash, but I just wanted to make sure the commits are as concise as possible.
Sorry, I meant to get back to you over the weekend, but it slipped my mind. Looks good to me!