pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Support Ibis Backend

Open cosmicBboy opened this issue 2 years ago • 15 comments

Is your feature request related to a problem? Please describe.

Pandera currently doesn't support validating data in a persistent datastore (e.g. MySQL, Postgres, etc). It would benefit users to be able to write pandera schemas that can then be compiled to a query language (like SQL), executed on a remote DB, that either:

  1. validates the data in-place, returning an error report if the data is invalid
  2. validates the data and load it into memory using some framework (e.g. pandas) for further processing

A high-leverage integration to enable this behavior would be with ibis, a data analytics framework that hooks into various backends (duckdb, mysql, postgres, etc).

Describe the solution you'd like

For the MVP integration with ibis:

  • Implement a schema specification for ibis
  • Implement a backend validator for ibis
  • Support writing custom checks

Describe alternatives you've considered NA

cosmicBboy avatar Mar 09 '23 16:03 cosmicBboy

@cosmicBboy Hey 👋🏻!

This looks pretty interesting!

Is there anything we can do over in ibis to help enable this? Happy to help!

cpcloud avatar Mar 13 '23 16:03 cpcloud

Thanks @cpcloud ! The pandera internals re-write is still happening (the last PR should be merged soon https://github.com/unionai-oss/pandera/pull/1109), after which I'm gonna start chipping away at a pandera-ibis package to see how well the rewritten internals fit the ibis programming model.

Is there anything we can do over in ibis to help enable this? Happy to help!

At this stage some conceptual help would be much appreciated! The main uncertainty in my mind is how well the current pandera abstractoins fit into ibis.

It would be awesome if the ibis team can take a look at the Schema and Schema Components classes described here and answer this high-level question:

Roughly speaking, how do pandas abstractions map into ibis?

For example:

pandas -> ibis

  • DataFrame -> Table
  • Series -> ?
  • Column -> ?
  • Index -> ?
  • MultiIndex -> ?

And a follow-up to this would be:

Do the pandera schema and schema components specification cover most of the properties that users would like to validate in an ibis table?

And finally, because pandera relies a lot on user-defined validation Checks:

How do ibis users specify custom operations that they want to do on tables/columns?

cosmicBboy avatar Mar 13 '23 22:03 cosmicBboy

@gforsyth this tracks the ibis integration! I'll circle back when I have capacity to get started on an integration in earnest

cosmicBboy avatar Jul 15 '23 19:07 cosmicBboy

Hello all, I and my team are working on a couple of data projects and for quite a few months we have been using pandera heavily for most of our row-wise data validation tasks.

Lately, we have moved to sourcing all our tables from a data warehouse and also our data size has grown (some tables have >50 columns ranging from 100-250 GB). And since we did not want to refactor most of our transformation steps, earlier written in pandas, we migrated to using ibis. While most of the migration was smooth, we are now facing difficulties in pulling the entire table into memory as a pandas.DataFrame and then validating it using pandera (which seems quite natural).

We did try and explore some hand-rolled alternatives by implementing a thin wrapper around validate method of pandera that pulls the table in chunks and then validates it. But it was not as efficient as we would have liked it to be.

Then, to my surprise, I found this thread !

I know that a few core developers from the ibis project have started putting some efforts (#1451 ) to support for pandera-ibis integration. While I am sure this will be a great addition to the pandera ecosystem, I had few questions:

  1. When we would use pandera with ibis, would we be able to validate the data by using the compute power of the query engine that the ibis backend is connected to (here I assume we are only connected to DB engine-like backends such as DuckDB, BigQuery etc.)?
  2. Would it be possible to do the validation without pulling the entire table into memory? (i.e. not materializing it as a memtable or pandas.DataFrame under the hood)
  3. If yes, what would be the type and schema of the failure_cases?
    • Type as in an in-memory data structure or a table in the backend? - Like a temporary/persistent table/view on the DB.
    • Schema as in currently by default it provides an "index" column to help the developers find which row has failed for which check. But usually a table in the backend may or may not have a primary key that can be used to identify the row uniquely. How would the failure_cases table present this information?

Nonetheless, I am eagerly waiting for this feature to be rolled out so that my team can get their hands on it.

P.S. I am not an expert in pandera or ibis internals, just a happy & enthusiastic user. 😄

Thanks !

csubhodeep avatar Jun 10 '24 15:06 csubhodeep

  1. When we would use pandera with ibis, would we be able to validate the data by using the compute power of the query engine that the ibis backend is connected to (here I assume we are only connected to DB engine-like backends such as DuckDB, BigQuery etc.)?

Yes!

  1. Would it be possible to do the validation without pulling the entire table into memory? (i.e. not materializing it as a memtable or pandas.DataFrame under the hood)

Also yes! At least, the goal is to offload as much computation as possible to the Ibis backend.

  1. If yes, what would be the type and schema of the failure_cases?

    • Type as in an in-memory data structure or a table in the backend? - Like a temporary/persistent table/view on the DB.
    • Schema as in currently by default it provides an "index" column to help the developers find which row has failed for which check. But usually a table in the backend may or may not have a primary key that can be used to identify the row uniquely. How would the failure_cases table present this information?

I will need to look into this; I just tested using the DuckDB backend for the example in #1451, and failure_cases is a pandas dataframe. However, I don't recall trying to actually implement anything for failure_cases, so I will need to do some digging and figure out (1) what type is failure_cases for other pandera backends and (2) is there any reason it can't be an Ibis table for the Ibis backend. I imagine it could be an Ibis table, but I'm not knowledgeable enough on pandera right now to say so confidently.

Hope that helps a bit! And sorry I have left the Ibis backend work in a dangling state; number of priorities keep coming up, but I do hope to resume progress on it soon!

deepyaman avatar Jun 13 '24 16:06 deepyaman

Thanks a lot @deepyaman for your insights ! Looking forward 🤞🏽 😃

csubhodeep avatar Jun 13 '24 22:06 csubhodeep

Hey @deepyaman now that #1451 is merged, would you mind outlining (just at a high level) what is left to implement? I guess we can go off of this: https://github.com/unionai-oss/pandera/pull/1451#issuecomment-2113325336

cosmicBboy avatar Jul 22 '24 16:07 cosmicBboy

Hey @deepyaman now that #1451 is merged, would you mind outlining (just at a high level) what is left to implement? I guess we can go off of this: #1451 (comment)

I think that makes sense! In all fairness, I think a lot is left to implement, because #1451 is the most minimal implementation. :)

I'm going to start with optional types (which I've promised for a bit, and haven't gotten to), and then take the low-hanging fruit of adding more types. That way, can demo something a bit more meaningful in the meantime, even if it's just data type checks. Want to see if can plug it into some existing integrations (like Kedro-Pandera) as-is.

Beyond that, I know @datajoely had mentioned looking at the data generation; will try and take a look at that after.

deepyaman avatar Jul 24 '24 14:07 deepyaman

Hello ibis and pandera team,

I wanted to try out the example from this comment but unfortunately get the below error

pandera.errors.BackendNotFoundError: Backend not found for backend, class: (<class 'pandera.api.pandas.container.DataFrameSchema'>, <class 'ibis.expr.types.relations.Table'>). Looked up the following base classes: (<class 'ibis.expr.types.relations.Table'>, <class 'ibis.expr.types.core.Expr'>, <class 'ibis.common.bases.Immutable'>, <class 'ibis.common.typing.Coercible'>, <class 'ibis.common.bases.Abstract'>, <class 'ibis.expr.types.core._FixedTextJupyterMixin'>, <class 'rich.jupyter.JupyterMixin'>, <class 'object'>)

P.S. I am installing pandera from the ibis-dev branch using plain:

pip install git+https://github.com/unionai-oss/pandera.git@ibis-dev

I am using Python 3.10.12 and ibis==9.2.0

Am I doing something wrong or am I bit too early?

csubhodeep avatar Jul 26 '24 22:07 csubhodeep

Hello ibis and pandera team,

I wanted to try out the example from this comment but unfortunately get the below error

pandera.errors.BackendNotFoundError: Backend not found for backend, class: (<class 'pandera.api.pandas.container.DataFrameSchema'>, <class 'ibis.expr.types.relations.Table'>). Looked up the following base classes: (<class 'ibis.expr.types.relations.Table'>, <class 'ibis.expr.types.core.Expr'>, <class 'ibis.common.bases.Immutable'>, <class 'ibis.common.typing.Coercible'>, <class 'ibis.common.bases.Abstract'>, <class 'ibis.expr.types.core._FixedTextJupyterMixin'>, <class 'rich.jupyter.JupyterMixin'>, <class 'object'>)

P.S. I am installing pandera from the ibis-dev branch using plain:

pip install git+https://github.com/unionai-oss/pandera.git@ibis-dev

I am using Python 3.10.12 and ibis==9.2.0

Am I doing something wrong or am I bit too early?

Hi @csubhodeep,

Support is very minimal right now, but that exact example should work. I did pip install 'git+https://github.com/unionai-oss/pandera.git@ibis-dev' ibis-framework pyarrow pyarrow-hotfix in a new 3.10.14 environment:

(pandera-test) deepyaman@deepyaman-mac deepyaman % conda list
# packages in environment at /opt/miniconda3/envs/pandera-test:
#
# Name                    Version                   Build  Channel
annotated-types           0.7.0                    pypi_0    pypi
atpublic                  4.1.0                    pypi_0    pypi
bzip2                     1.0.8                h80987f9_6  
ca-certificates           2024.7.2             hca03da5_0  
ibis-framework            9.2.0                    pypi_0    pypi
libffi                    3.4.4                hca03da5_1  
markdown-it-py            3.0.0                    pypi_0    pypi
mdurl                     0.1.2                    pypi_0    pypi
multimethod               1.10                     pypi_0    pypi
mypy-extensions           1.0.0                    pypi_0    pypi
ncurses                   6.4                  h313beb8_0  
numpy                     2.0.1                    pypi_0    pypi
openssl                   3.0.14               h80987f9_0  
packaging                 24.1                     pypi_0    pypi
pandas                    2.2.2                    pypi_0    pypi
pandera                   0.0.0+dev0               pypi_0    pypi
parsy                     2.1                      pypi_0    pypi
pip                       24.0            py310hca03da5_0  
pyarrow                   17.0.0                   pypi_0    pypi
pyarrow-hotfix            0.6                      pypi_0    pypi
pydantic                  2.8.2                    pypi_0    pypi
pydantic-core             2.20.1                   pypi_0    pypi
pygments                  2.18.0                   pypi_0    pypi
python                    3.10.14              hb885b13_1  
python-dateutil           2.9.0.post0              pypi_0    pypi
pytz                      2024.1                   pypi_0    pypi
readline                  8.2                  h1a28f6b_0  
rich                      13.7.1                   pypi_0    pypi
setuptools                69.5.1          py310hca03da5_0  
six                       1.16.0                   pypi_0    pypi
sqlglot                   25.6.1                   pypi_0    pypi
sqlite                    3.45.3               h80987f9_0  
tk                        8.6.14               h6ba3021_0  
toolz                     0.12.1                   pypi_0    pypi
typeguard                 4.3.0                    pypi_0    pypi
typing-extensions         4.12.2                   pypi_0    pypi
typing-inspect            0.9.0                    pypi_0    pypi
tzdata                    2024.1                   pypi_0    pypi
wheel                     0.43.0          py310hca03da5_0  
wrapt                     1.16.0                   pypi_0    pypi
xz                        5.4.6                h80987f9_1  
zlib                      1.2.13               h18a0788_1  

Then, I had no issues running the happy- and unhappy-path examples.

deepyaman avatar Jul 26 '24 23:07 deepyaman

hey @deepyaman just wanted to check in: are you implementing the IbisCheckBackend class?

cosmicBboy avatar Sep 30 '24 23:09 cosmicBboy

hey @deepyaman just wanted to check in: are you implementing the IbisCheckBackend class?

Everything I had in progress is pushed up (or lost anyway, if I missed anything)!

At this point, my plan is to go through the set of what's implemented on the Polars backend and make a list of what still needs to be done--I assume IbisCheckBackend is one of those--but I probably won't have time go do this for at least a week or two.

deepyaman avatar Oct 01 '24 13:10 deepyaman

Cool, I'll take a look at stubbing out the IbisCheckBackend and spec out the rest of the core checks that need to be implemented, but so far it's looking great!

cosmicBboy avatar Oct 03 '24 00:10 cosmicBboy

Is there any timeline (even very rough timeline) for implementation of this? My enterprise is currently evaluating options, and we are heavily considering Ibis + Pandera.

RH-TLagrone avatar Jun 09 '25 21:06 RH-TLagrone

Is there any timeline (even very rough timeline) for implementation of this? My enterprise is currently evaluating options, and we are heavily considering Ibis + Pandera.

@RH-TLagrone Yes! The implementation is pretty much feature-complete for an initial release, and we expect it to be out later this month. I've recently finished putting together the user guide, and we're just working through some final fixes.

deepyaman avatar Jun 09 '25 22:06 deepyaman