pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Implement basic validation backend for Ibis tables

Open deepyaman opened this issue 1 year ago • 21 comments
trafficstars

Taking a stab at #1105.

~No need to review, still very much a WIP. Working on implementing the backend classes now.~

Initial goal is to get this example running:

>>> import ibis
>>> import pandas as pd
>>> import pandera.ibis as pa
>>>
>>> df = pd.DataFrame({
...     "probability": [0.1, 0.4, 0.52, 0.23, 0.8, 0.76],
...     "category": ["dog", "dog", "cat", "duck", "dog", "dog"],
... })
>>> t = ibis.memtable(df, name="t")
>>>
>>> schema_withchecks = pa.DataFrameSchema({
...     "probability": pa.Column(
...         float, pa.Check(lambda s: (s >= 0) & (s <= 1))),
...
...     # check that the "category" column contains a few discrete
...     # values, and the majority of the entries are dogs.
...     "category": pa.Column(
...         str, [
...             pa.Check(lambda s: s.isin(["dog", "cat", "duck"])),
...             pa.Check(lambda s: (s == "dog").mean() > 0.5),
...         ]),
... })
>>>
>>> schema_withchecks.validate(t)[["probability", "category"]]
   probability category
0         0.10      dog
1         0.40      dog
2         0.52      cat
3         0.23     duck
4         0.80      dog
5         0.76      dog

See https://github.com/unionai-oss/pandera/pull/1451#issuecomment-1877934136 for the current state.

deepyaman avatar Dec 20 '23 17:12 deepyaman

Codecov Report

Attention: Patch coverage is 78.35052% with 63 lines in your changes missing coverage. Please review.

Project coverage is 92.80%. Comparing base (4df61da) to head (612d65a). Report is 131 commits behind head on ibis-dev.

Files Patch % Lines
pandera/backends/ibis/base.py 29.26% 29 Missing :warning:
pandera/api/ibis/model.py 73.33% 12 Missing :warning:
pandera/backends/ibis/container.py 78.26% 10 Missing :warning:
pandera/engines/ibis_engine.py 80.00% 9 Missing :warning:
pandera/backends/ibis/components.py 93.33% 3 Missing :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##           ibis-dev    #1451      +/-   ##
============================================
- Coverage     94.29%   92.80%   -1.50%     
============================================
  Files            91      130      +39     
  Lines          7024     9403    +2379     
============================================
+ Hits           6623     8726    +2103     
- Misses          401      677     +276     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Dec 20 '23 17:12 codecov[bot]

This is a great start @deepyaman ! I just created a ibis-dev branch that we can merge all of our work into: https://github.com/unionai-oss/pandera/tree/ibis-dev

cosmicBboy avatar Dec 22 '23 15:12 cosmicBboy

@cosmicBboy Happy belated New Year! Hope you enjoyed the holidays.

I made some (admittedly-slow) progress on the Ibis backend, and I'd be happy to get a review to make sure things are on the right track. Been learning a lot about how Pandera works.

This is very incomplete, but I think I'm implemented a happy and unhappy path that work:

Happy path

>>> import ibis
>>> import pandas as pd
>>> import pandera.ibis as pa
/opt/miniconda3/envs/pandera-dev/lib/python3.11/site-packages/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
  warnings.warn(
>>> 
>>> df = pd.DataFrame({
...     "probability": [0.1, 0.4, 0.52, 0.23, 0.8, 0.76],
...     "category": ["dog", "dog", "cat", "duck", "dog", "dog"],
... })
>>> t = ibis.memtable(df, name="t")
>>> schema_withchecks = pa.DataFrameSchema({"probability": pa.Column(float)})
>>> schema_withchecks.validate(t)[["probability", "category"]]
r0 := InMemoryTable
  data:
    PandasDataFrameProxy:
         probability category
      0         0.10      dog
      1         0.40      dog
      2         0.52      cat
      3         0.23     duck
      4         0.80      dog
      5         0.76      dog

Selection[r0]
  selections:
    probability: r0.probability
    category:    r0.category
>>> 

Unhappy path

>>> import ibis
>>> import pandas as pd
>>> import pandera.ibis as pa
/opt/miniconda3/envs/pandera-dev/lib/python3.11/site-packages/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
  warnings.warn(
>>> 
>>> df = pd.DataFrame({
...     "probability": [1, 4, 52, 23, 8, 76],
...     "category": ["dog", "dog", "cat", "duck", "dog", "dog"],
... })
>>> t = ibis.memtable(df, name="t")
>>> schema_withchecks = pa.DataFrameSchema({"probability": pa.Column(float)})
>>> schema_withchecks.validate(t)[["probability", "category"]]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/api/ibis/container.py", line 80, in validate
    return self.get_backend(check_obj).validate(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/backends/ibis/container.py", line 72, in validate
    error_handler.collect_error(
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/error_handlers.py", line 38, in collect_error
    raise schema_error from original_exc
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/backends/ibis/container.py", line 103, in run_schema_component_checks
    result = schema_component.validate(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/api/pandas/components.py", line 169, in validate
    return self.get_backend(check_obj).validate(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/backends/ibis/components.py", line 69, in validate
    error_handler.collect_error(  # Why indent (unlike in container.py)?
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/error_handlers.py", line 38, in collect_error
    raise schema_error from original_exc
pandera.errors.SchemaError: expected column 'probability' to have type float64, got int64
>>> 

As a next step (if this looks good), I can probably work on writing tests for this functionality, and then implement additional checks (with the goal of getting the example in the PR description working first). Probably start with getting the value check on the same column (pa.Check(lambda s: (s >= 0) & (s <= 1))) working.

deepyaman avatar Jan 05 '24 00:01 deepyaman

I was searching to see if this exists and what a delight to see your name here @deepyaman thanks for kicking this off!

datajoely avatar Feb 05 '24 10:02 datajoely

Thanks for this PR @deepyaman thanks so much for all your work on kicking this effort off!

I'm still in the process of looking through all the changes in this PR, but the examples you provide here is the right direction. If you can get some tests for these written, we can check this into the ibis-dev branch so we can get the initial foothold on this feature merged into the main repo.

Super excited to get this shipped! 🚀

cosmicBboy avatar Feb 19 '24 04:02 cosmicBboy

Thanks for this PR @deepyaman thanks so much for all your work on kicking this effort off!

I'm still in the process of looking through all the changes in this PR, but the examples you provide here is the right direction. If you can get some tests for these written, we can check this into the ibis-dev branch so we can get the initial foothold on this feature merged into the main repo.

Super excited to get this shipped! 🚀

@cosmicBboy Sounds great! I'm on holiday/unavailable until the end of the month, but I will prioritize adding the tests after that (and then continue making incremental process)!

deepyaman avatar Feb 23 '24 15:02 deepyaman

Cool, enjoy your time off! Btw the beta release for polars support is out https://github.com/unionai-oss/pandera/releases/tag/v0.19.0b0

Part of it comes with a pandera.api.dataframe module that generalizes some of the common functionality between pandas and polars. It currently only does this for DataFrameModel, but I'm planning on doing the same for DataFrameSchema.

You may want to leverage these classes in your implementation to reduce repetitive logic, depending on how similar/different it is from polars/pandas.

cosmicBboy avatar Mar 15 '24 18:03 cosmicBboy

Hey @cosmicBboy @deepyaman I'd like to jump in and get this over the line. It's a little hard to follow where's best to get stuck in, do either of you have any recommendations?

datajoely avatar May 15 '24 12:05 datajoely

Thanks @datajoely! I think @deepyaman may still be on leave, so maybe we wait on him to provide more context, but at a high level, there are parts that are fairly easy to parallelize. For example, for the core pandera functionality with the polars integration, we have the following modules:

  • The api: these are the user-facing schema definitions: https://github.com/unionai-oss/pandera/tree/main/pandera/api/polars
    • container.py: DataFrame/LazyFrame
    • components.py: Column definition
    • model.py: the DataFrameModel class-based syntax
    • model_config.py: configurations for DataFrameModel
  • The backend: this is the implementation of how to actually validate data: https://github.com/unionai-oss/pandera/tree/main/pandera/backends/polars
    • These generally have modules associated with the api
    • Additionally it has the checks.py backend for how to run the built-in or user-defined checks, and builtin_checks.py that provide the implementations for the built-in pandera checks.
    • Backend registration function: https://github.com/unionai-oss/pandera/blob/main/pandera/backends/polars/register.py
  • The type engine: https://github.com/unionai-oss/pandera/blob/main/pandera/engines/polars_engine.py
    • This contains the pandera dtype translation layer from the dataframe library to dtypes that pandera can understand.
  • Generic classes for python type annotations: https://github.com/unionai-oss/pandera/blob/main/pandera/typing/polars.py
  • Finally, the library integration entrypoint: https://github.com/unionai-oss/pandera/blob/main/pandera/polars.py

Work in each of these sections is fairly parallelizable. To help with implementation, pandera also provides library-agnostic base classes for some of the common class definitions:

  • API spec for generic dataframes: https://github.com/unionai-oss/pandera/tree/main/pandera/api/dataframe
  • the user-facing class for Checks: https://github.com/unionai-oss/pandera/blob/main/pandera/api/checks.py

This PR already implements a bunch of the pieces above. Probably the best way to start is to run the happy path code path described here, and start poking around.

cosmicBboy avatar May 15 '24 19:05 cosmicBboy

Thanks @datajoely! I think @deepyaman may still be on leave, so maybe we wait on him to provide more context, but at a high level, there are parts that are fairly easy to parallelize. For example, for the core pandera functionality with the polars integration, we have the following modules:

[...]

@cosmicBboy Thanks! This is actually a very helpful breakdown. I also briefly chatted with @datajoely earlier today about this (should've updated the issue), but it seems he's taken a look at the Polars backend more recently, and will push up a branch with some things he'd been trying. As a first step, it may make sense to update this PR to leverage some of the generalized functionality more; since I wrote this last December, I referenced the polars-dev branch a lot, but it was a WIP and things must have changed. I just haven't gotten a chance to look again yet. :)

I will have some time to dedicate to this once I get the stuff I'm currently working on out the door, hopefully by later this month. 🤞 But also happy to try to unblock @datajoely if anything is there I can do sooner.

deepyaman avatar May 16 '24 03:05 deepyaman

Excellent - super helpful :)

datajoely avatar May 16 '24 07:05 datajoely

So I've raised #1651 in draft - it would be great to get some thoughts whether it's smarter to mirror the quite complicated Polars approach or to continue with this basic approach.

datajoely avatar May 21 '24 10:05 datajoely

FYI I am working on rebasing this to main today and leveraging the new constructs mentioned! Hope to push something up soon.

deepyaman avatar Jun 19 '24 19:06 deepyaman

Amazing @deepyaman! let me know if you need any help

cosmicBboy avatar Jun 19 '24 20:06 cosmicBboy

Amazing @deepyaman! let me know if you need any help

@cosmicBboy Thanks! Can you update unionai-oss:ibis-dev to main?

Also, would it be possible to set up ibis-dev to run the checks, at least for Ibis, if we're using that as the "main" while building out Ibis integration?

deepyaman avatar Jun 20 '24 14:06 deepyaman

Amazing! I just updated the ibis-dev branch to catch up with main

cosmicBboy avatar Jun 20 '24 14:06 cosmicBboy

Seems I hadn't fully rebased onto latest main before? In any case, done now, and the happy/unhappy paths from https://github.com/unionai-oss/pandera/pull/1451#issuecomment-1877934136 are still passing.

Let me add some more tests; other than that, I think I need to make sure Ibis is included in the requirements, and I was running into a few issues running mypy locally (getting a bunch of errors on other parts of the codebase), but should be almost ready to be merged. 🤞

I've also talked briefly to @datajoely. My suggestion was:

Maybe you can look at implementing the functionality corresponding to tests/polars/test_polars_dtypes.py, around coercion? I haven't really looked into it. Can start with the numeric types (I've only added a couple of numeric types so far).

I was planning to try adding a non-dtype check next myself. Once that machinery is there, I think can also parallelize implementation of checks.

@cosmicBboy if you have a better suggestion for where to get started/don't think coercion is a good place, I think that would be very much welcome.

deepyaman avatar Jun 20 '24 14:06 deepyaman

This is 🔥 @deepyaman, super excited to get this first pass merged into the dev branch.

if you have a better suggestion for where to get started/don't think coercion is a good place, I think that would be very much welcome

Yes, I think the dtypes tests are the best place to start, both for strict type checking (no coercion) and also coercion.

cosmicBboy avatar Jun 20 '24 20:06 cosmicBboy

I haven't tested the changes in https://github.com/unionai-oss/pandera/pull/1451/commits/5bf4279c6ffe99268ed0466bd1a82a555faffafa that well; I do think I broke using native types like int and str for now (have to specify ibis.expr.datatypes.Int64 or ibis.expr.datatypes.String, respectively). Will try to resolve this.

Temporarily changed the base, because I wanted to see CI run. (Update: I see; the annotation-unchecked messages are just notes that can be ignored. Then what I'm seeing locally is good, and I'll just fix the actual type errors!)

deepyaman avatar Jun 24 '24 07:06 deepyaman

FYI @cosmicBboy I rebased this onto latest main (to get CI to run). The main commit I rebased on top of is a041e060c9e8ddaed5e28e27fac86560148d8935.

Looks like Pandera still supports 3.8, despite following NEP-29 (which is fine). However, Ibis dropped 3.8 support a while back, so ibis-framework resolves to 5.1.0 on 3.8—a very old version I don't plan to build for. Shall we skip Ibis tests on 3.8?

I think this is in a reviewable (potentially mergeable?) state. I will take a look if I can add some more tests. Would it be possible to rebase-merge this to ibis-dev instead of squashing? Have tried to maintain a clean history, and I think it may help in trying to debug some decisions as continue development on the ibis-dev branch.

deepyaman avatar Jun 26 '24 21:06 deepyaman

Thanks @deepyaman, I'll take a look at this PR next week.

Shall we skip Ibis tests on 3.8?

Yes!

Looks like Pandera still supports 3.8, despite following NEP-29 (which is fine)

Yes, I've been meaning to drop 3.8 support, maybe this happens when ibis-dev is merged onto main when it's ready for a beta release?

Would it be possible to rebase-merge this to ibis-dev instead of squashing?

Yep, we can do this

cosmicBboy avatar Jun 27 '24 14:06 cosmicBboy

Sorry for the delay on reviewing this @deepyaman, will do so this coming week

cosmicBboy avatar Jul 13 '24 20:07 cosmicBboy

@deepyaman I just updated the ibis-dev branch to main (@c895dc4), would you mind regenerating the requirements files with make nox-requirements?

cosmicBboy avatar Jul 13 '24 21:07 cosmicBboy

@deepyaman I just updated the ibis-dev branch to main (@c895dc4), would you mind regenerating the requirements files with make nox-requirements?

@cosmicBboy Done, although CI seems to be choking on some installation issue.

deepyaman avatar Jul 16 '24 16:07 deepyaman

thanks! yeah haven't diagnosed the CI issue yet, for some reason this happens fairly often... restarting CI eventually resolves it

cosmicBboy avatar Jul 16 '24 21:07 cosmicBboy

@deepyaman this initial PR looks good to me! There may be more things we'll refactor as we go along, but I want to get this into the ibis-dev branch.

Just to confirm, we want to rebase all the changes instead of squashing right?

cosmicBboy avatar Jul 18 '24 12:07 cosmicBboy

@deepyaman this initial PR looks good to me! There may be more things we'll refactor as we go along, but I want to get this into the ibis-dev branch.

Just to confirm, we want to rebase all the changes instead of squashing right?

That would be great! Up to you what you want to do when get it into main eventually, but for now I think it will be helpful for me to be able to track some of the changes/decisions, especially for this PR that changed a lot of things. :)

deepyaman avatar Jul 18 '24 17:07 deepyaman

@deepyaman do you mind if we collapse some of the commits into one?

For example, the following can probably be squashed into one:

then:

There's probably more that make sense to squash, but I just wanted to make sure the commits are as concise as possible.

cosmicBboy avatar Jul 19 '24 13:07 cosmicBboy

just rebased, squashing commits that can go together

cosmicBboy avatar Jul 20 '24 01:07 cosmicBboy

@deepyaman do you mind if we collapse some of the commits into one?

For example, the following can probably be squashed into one:

then:

There's probably more that make sense to squash, but I just wanted to make sure the commits are as concise as possible.

Sorry, I meant to get back to you over the weekend, but it slipped my mind. Looks good to me!

deepyaman avatar Jul 22 '24 16:07 deepyaman