pandera icon indicating copy to clipboard operation
pandera copied to clipboard

[wip] core and backend pandera API

Open cosmicBboy opened this issue 1 year ago • 0 comments

fixes #381

Fundamentally, pandera is about defining types for statistical data containers (e.g. pandas DataFrames, xarray Datasets, SQL tables) that serve to:

  1. self-document the properties of data in code
  2. validate those properties at run-time
  3. provide some basic type-linting capabilities (currently still somewhat limited)

What

This PR introduces two new subpackages to pandera:

  • core: this defines schema specifications for particular families of data containers, e.g. "pandas-like dataframes". This module is responsible for defining the properties held by these data containers.
  • backends: this defines the underlying implementation of the validation logic given a particular schema specification. This module is responsible for actually verifying those properties for a specific type of data container (e.g. for pandas DataFrames, modin, dask, pyspark.pandas DataFrames, etc.)

Why?

The purpose of this PR is to:

  • decouple and abstract the specification from the thing that actually runs the validation rules.
  • provide base classes upon which the community can build schema specifications and backends for potentially any arbitrary data structure, including xarray.Datasets, numpy arrays, tensore objects, etc, all with a focus on:
    • coercion/validation of data types per field
    • validation of arbitrary properties, in particular statistical properties across records or the container's equivalent of records.

This change will not effect the user-facing API of pandera and will not introduce any breaking changes.

Design Implications

  1. For each core schema specification, there may be multiple backends that can apply to it. For example, I can define a DataFrameSchema and, depending on the type of dataframe that I supply to schema.validate, pandera will delegate to a particular backend.
  2. Instead of trying to design "one schema specification to rule them all", pandera will try to strike a balance between keeping the API surface as small as possible while embracing the richness and diversity of dataframe-like objects that now exist.

Phases

This PR will be the first phase in a multi-phase approach to improving extensibility:

  1. [this PR] introduce decoupled architecture, with no fundamental changes to pandera's functionality and implementation
  2. introduce multiple backends for the dataframe object: clean up the dataframe validation code by having separate backends for modin, dask, pyspark.pandas, geopandas (the motivation here is to ensure the backend abstraction makes sense).
  3. introduce a schema specification and backend for SQL tables:: borrow specification from dataframe schemas to introduce SQL-native validation using SQLAlchemy. (the motivation here is to ensure the core + backend abstactions make sense from an extensibility stand point)
  4. introduce a pandera-contrib ecosystem: this exists to host other pandera-compliant projects (e.g. https://github.com/carbonplan/xarray-schema) so that the broader community can use pandera's core and backend abstractions to build their own schema types.

cosmicBboy avatar Aug 12 '22 19:08 cosmicBboy