pandera
pandera copied to clipboard
Support Ibis Backend
Is your feature request related to a problem? Please describe.
Pandera currently doesn't support validating data in a persistent datastore (e.g. MySQL, Postgres, etc). It would benefit users to be able to write pandera schemas that can then be compiled to a query language (like SQL), executed on a remote DB, that either:
- validates the data in-place, returning an error report if the data is invalid
- validates the data and load it into memory using some framework (e.g. pandas) for further processing
A high-leverage integration to enable this behavior would be with ibis, a data analytics framework that hooks into various backends (duckdb, mysql, postgres, etc).
Describe the solution you'd like
For the MVP integration with ibis:
- Implement a schema specification for ibis
- Implement a backend validator for ibis
- Support writing custom checks
Describe alternatives you've considered NA
@cosmicBboy Hey 👋🏻!
This looks pretty interesting!
Is there anything we can do over in ibis to help enable this? Happy to help!
Thanks @cpcloud ! The pandera internals re-write is still happening (the last PR should be merged soon https://github.com/unionai-oss/pandera/pull/1109), after which I'm gonna start chipping away at a pandera-ibis package to see how well the rewritten internals fit the ibis programming model.
Is there anything we can do over in ibis to help enable this? Happy to help!
At this stage some conceptual help would be much appreciated! The main uncertainty in my mind is how well the current pandera abstractoins fit into ibis.
It would be awesome if the ibis team can take a look at the Schema and Schema Components classes described here and answer this high-level question:
Roughly speaking, how do pandas abstractions map into ibis?
For example:
pandas -> ibis
- DataFrame -> Table
- Series -> ?
- Column -> ?
- Index -> ?
- MultiIndex -> ?
And a follow-up to this would be:
Do the pandera schema and schema components specification cover most of the properties that users would like to validate in an ibis table?
And finally, because pandera relies a lot on user-defined validation Checks:
How do ibis users specify custom operations that they want to do on tables/columns?
@gforsyth this tracks the ibis integration! I'll circle back when I have capacity to get started on an integration in earnest
Hello all, I and my team are working on a couple of data projects and for quite a few months we have been using pandera heavily for most of our row-wise data validation tasks.
Lately, we have moved to sourcing all our tables from a data warehouse and also our data size has grown (some tables have >50 columns ranging from 100-250 GB). And since we did not want to refactor most of our transformation steps, earlier written in pandas, we migrated to using ibis.
While most of the migration was smooth, we are now facing difficulties in pulling the entire table into memory as a pandas.DataFrame and then validating it using pandera (which seems quite natural).
We did try and explore some hand-rolled alternatives by implementing a thin wrapper around validate method of pandera that pulls the table in chunks and then validates it. But it was not as efficient as we would have liked it to be.
Then, to my surprise, I found this thread !
I know that a few core developers from the ibis project have started putting some efforts (#1451 ) to support for pandera-ibis integration. While I am sure this will be a great addition to the pandera ecosystem, I had few questions:
- When we would use
panderawithibis, would we be able to validate the data by using the compute power of the query engine that the ibis backend is connected to (here I assume we are only connected to DB engine-like backends such as DuckDB, BigQuery etc.)? - Would it be possible to do the validation without pulling the entire table into memory? (i.e. not materializing it as a
memtableorpandas.DataFrameunder the hood) - If yes, what would be the type and schema of the
failure_cases?- Type as in an in-memory data structure or a table in the backend? - Like a temporary/persistent table/view on the DB.
- Schema as in currently by default it provides an "index" column to help the developers find which row has failed for which check. But usually a table in the backend may or may not have a primary key that can be used to identify the row uniquely. How would the
failure_casestable present this information?
Nonetheless, I am eagerly waiting for this feature to be rolled out so that my team can get their hands on it.
P.S. I am not an expert in pandera or ibis internals, just a happy & enthusiastic user. 😄
Thanks !
- When we would use
panderawithibis, would we be able to validate the data by using the compute power of the query engine that the ibis backend is connected to (here I assume we are only connected to DB engine-like backends such as DuckDB, BigQuery etc.)?
Yes!
- Would it be possible to do the validation without pulling the entire table into memory? (i.e. not materializing it as a
memtableorpandas.DataFrameunder the hood)
Also yes! At least, the goal is to offload as much computation as possible to the Ibis backend.
If yes, what would be the type and schema of the
failure_cases?
- Type as in an in-memory data structure or a table in the backend? - Like a temporary/persistent table/view on the DB.
- Schema as in currently by default it provides an "index" column to help the developers find which row has failed for which check. But usually a table in the backend may or may not have a primary key that can be used to identify the row uniquely. How would the
failure_casestable present this information?
I will need to look into this; I just tested using the DuckDB backend for the example in #1451, and failure_cases is a pandas dataframe. However, I don't recall trying to actually implement anything for failure_cases, so I will need to do some digging and figure out (1) what type is failure_cases for other pandera backends and (2) is there any reason it can't be an Ibis table for the Ibis backend. I imagine it could be an Ibis table, but I'm not knowledgeable enough on pandera right now to say so confidently.
Hope that helps a bit! And sorry I have left the Ibis backend work in a dangling state; number of priorities keep coming up, but I do hope to resume progress on it soon!
Thanks a lot @deepyaman for your insights ! Looking forward 🤞🏽 😃
Hey @deepyaman now that #1451 is merged, would you mind outlining (just at a high level) what is left to implement? I guess we can go off of this: https://github.com/unionai-oss/pandera/pull/1451#issuecomment-2113325336
Hey @deepyaman now that #1451 is merged, would you mind outlining (just at a high level) what is left to implement? I guess we can go off of this: #1451 (comment)
I think that makes sense! In all fairness, I think a lot is left to implement, because #1451 is the most minimal implementation. :)
I'm going to start with optional types (which I've promised for a bit, and haven't gotten to), and then take the low-hanging fruit of adding more types. That way, can demo something a bit more meaningful in the meantime, even if it's just data type checks. Want to see if can plug it into some existing integrations (like Kedro-Pandera) as-is.
Beyond that, I know @datajoely had mentioned looking at the data generation; will try and take a look at that after.
Hello ibis and pandera team,
I wanted to try out the example from this comment but unfortunately get the below error
pandera.errors.BackendNotFoundError: Backend not found for backend, class: (<class 'pandera.api.pandas.container.DataFrameSchema'>, <class 'ibis.expr.types.relations.Table'>). Looked up the following base classes: (<class 'ibis.expr.types.relations.Table'>, <class 'ibis.expr.types.core.Expr'>, <class 'ibis.common.bases.Immutable'>, <class 'ibis.common.typing.Coercible'>, <class 'ibis.common.bases.Abstract'>, <class 'ibis.expr.types.core._FixedTextJupyterMixin'>, <class 'rich.jupyter.JupyterMixin'>, <class 'object'>)
P.S. I am installing pandera from the ibis-dev branch using plain:
pip install git+https://github.com/unionai-oss/pandera.git@ibis-dev
I am using Python 3.10.12 and ibis==9.2.0
Am I doing something wrong or am I bit too early?
Hello ibis and pandera team,
I wanted to try out the example from this comment but unfortunately get the below error
pandera.errors.BackendNotFoundError: Backend not found for backend, class: (<class 'pandera.api.pandas.container.DataFrameSchema'>, <class 'ibis.expr.types.relations.Table'>). Looked up the following base classes: (<class 'ibis.expr.types.relations.Table'>, <class 'ibis.expr.types.core.Expr'>, <class 'ibis.common.bases.Immutable'>, <class 'ibis.common.typing.Coercible'>, <class 'ibis.common.bases.Abstract'>, <class 'ibis.expr.types.core._FixedTextJupyterMixin'>, <class 'rich.jupyter.JupyterMixin'>, <class 'object'>)P.S. I am installing
panderafrom theibis-devbranch using plain:pip install git+https://github.com/unionai-oss/pandera.git@ibis-devI am using Python 3.10.12 and
ibis==9.2.0Am I doing something wrong or am I bit too early?
Hi @csubhodeep,
Support is very minimal right now, but that exact example should work. I did pip install 'git+https://github.com/unionai-oss/pandera.git@ibis-dev' ibis-framework pyarrow pyarrow-hotfix in a new 3.10.14 environment:
(pandera-test) deepyaman@deepyaman-mac deepyaman % conda list
# packages in environment at /opt/miniconda3/envs/pandera-test:
#
# Name Version Build Channel
annotated-types 0.7.0 pypi_0 pypi
atpublic 4.1.0 pypi_0 pypi
bzip2 1.0.8 h80987f9_6
ca-certificates 2024.7.2 hca03da5_0
ibis-framework 9.2.0 pypi_0 pypi
libffi 3.4.4 hca03da5_1
markdown-it-py 3.0.0 pypi_0 pypi
mdurl 0.1.2 pypi_0 pypi
multimethod 1.10 pypi_0 pypi
mypy-extensions 1.0.0 pypi_0 pypi
ncurses 6.4 h313beb8_0
numpy 2.0.1 pypi_0 pypi
openssl 3.0.14 h80987f9_0
packaging 24.1 pypi_0 pypi
pandas 2.2.2 pypi_0 pypi
pandera 0.0.0+dev0 pypi_0 pypi
parsy 2.1 pypi_0 pypi
pip 24.0 py310hca03da5_0
pyarrow 17.0.0 pypi_0 pypi
pyarrow-hotfix 0.6 pypi_0 pypi
pydantic 2.8.2 pypi_0 pypi
pydantic-core 2.20.1 pypi_0 pypi
pygments 2.18.0 pypi_0 pypi
python 3.10.14 hb885b13_1
python-dateutil 2.9.0.post0 pypi_0 pypi
pytz 2024.1 pypi_0 pypi
readline 8.2 h1a28f6b_0
rich 13.7.1 pypi_0 pypi
setuptools 69.5.1 py310hca03da5_0
six 1.16.0 pypi_0 pypi
sqlglot 25.6.1 pypi_0 pypi
sqlite 3.45.3 h80987f9_0
tk 8.6.14 h6ba3021_0
toolz 0.12.1 pypi_0 pypi
typeguard 4.3.0 pypi_0 pypi
typing-extensions 4.12.2 pypi_0 pypi
typing-inspect 0.9.0 pypi_0 pypi
tzdata 2024.1 pypi_0 pypi
wheel 0.43.0 py310hca03da5_0
wrapt 1.16.0 pypi_0 pypi
xz 5.4.6 h80987f9_1
zlib 1.2.13 h18a0788_1
Then, I had no issues running the happy- and unhappy-path examples.
hey @deepyaman just wanted to check in: are you implementing the IbisCheckBackend class?
hey @deepyaman just wanted to check in: are you implementing the
IbisCheckBackendclass?
Everything I had in progress is pushed up (or lost anyway, if I missed anything)!
At this point, my plan is to go through the set of what's implemented on the Polars backend and make a list of what still needs to be done--I assume IbisCheckBackend is one of those--but I probably won't have time go do this for at least a week or two.
Cool, I'll take a look at stubbing out the IbisCheckBackend and spec out the rest of the core checks that need to be implemented, but so far it's looking great!
Is there any timeline (even very rough timeline) for implementation of this? My enterprise is currently evaluating options, and we are heavily considering Ibis + Pandera.
Is there any timeline (even very rough timeline) for implementation of this? My enterprise is currently evaluating options, and we are heavily considering Ibis + Pandera.
@RH-TLagrone Yes! The implementation is pretty much feature-complete for an initial release, and we expect it to be out later this month. I've recently finished putting together the user guide, and we're just working through some final fixes.