pandera
pandera copied to clipboard
[wip] core and backend pandera API
fixes #381
Fundamentally, pandera is about defining types for statistical data containers (e.g. pandas DataFrames
, xarray Datasets
, SQL tables) that serve to:
- self-document the properties of data in code
- validate those properties at run-time
- provide some basic type-linting capabilities (currently still somewhat limited)
What
This PR introduces two new subpackages to pandera:
-
core
: this defines schema specifications for particular families of data containers, e.g. "pandas-like dataframes". This module is responsible for defining the properties held by these data containers. -
backends
: this defines the underlying implementation of the validation logic given a particular schema specification. This module is responsible for actually verifying those properties for a specific type of data container (e.g. for pandas DataFrames, modin, dask, pyspark.pandas DataFrames, etc.)
Why?
The purpose of this PR is to:
- decouple and abstract the specification from the thing that actually runs the validation rules.
- provide base classes upon which the community can build schema specifications and backends for potentially any arbitrary data structure, including
xarray.Dataset
s, numpy arrays, tensore objects, etc, all with a focus on:- coercion/validation of data types per field
- validation of arbitrary properties, in particular statistical properties across records or the container's equivalent of records.
This change will not effect the user-facing API of pandera and will not introduce any breaking changes.
Design Implications
- For each
core
schema specification, there may be multiplebackends
that can apply to it. For example, I can define aDataFrameSchema
and, depending on the type of dataframe that I supply toschema.validate
, pandera will delegate to a particular backend. - Instead of trying to design "one schema specification to rule them all", pandera will try to strike a balance between keeping the API surface as small as possible while embracing the richness and diversity of dataframe-like objects that now exist.
Phases
This PR will be the first phase in a multi-phase approach to improving extensibility:
- [this PR] introduce decoupled architecture, with no fundamental changes to pandera's functionality and implementation
-
introduce multiple
backends
for the dataframe object: clean up the dataframe validation code by having separate backends formodin
,dask
,pyspark.pandas
,geopandas
(the motivation here is to ensure the backend abstraction makes sense). - introduce a schema specification and backend for SQL tables:: borrow specification from dataframe schemas to introduce SQL-native validation using SQLAlchemy. (the motivation here is to ensure the core + backend abstactions make sense from an extensibility stand point)
-
introduce a
pandera-contrib
ecosystem: this exists to host other pandera-compliant projects (e.g. https://github.com/carbonplan/xarray-schema) so that the broader community can use pandera's core and backend abstractions to build their own schema types.