pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Abstract out validation logic to support non-pandas dataframes, e.g. spark, dask, etc

Open cosmicBboy opened this issue 3 years ago • 16 comments

Is your feature request related to a problem? Please describe.

Extending pandera to non-pandas dataframe-like structures is a challenge today because the schema and schema component class definitions are strongly coupled with the pandas API. For example, the DataFramesSchema.validate method assumes that validated objects follow the pandas API.

Potential Solutions

  1. Abstract out the core pandera interface into Schema, SchemaComponent, and Check abstract base classes so that core and third-party pandera schemas can be easily developed on top of it. Subclasses of these base classes would implement the validation logic for a specific library, e.g. SparkSchema, PandasSchema, etc.
  2. Provide a validation engine interface where core and third-party developers can register and use different validation backends depending on the type of dataframe implementation (e.g. pandas, spark, dask, etc) being used, similar to the proposal in #369. The public-facing API won't change: different dataframe types would be validated via different (non-mutually exclusive) approaches:
    • at runtime validation, pandera delegates to the appropriate engine based on the type of obj when schema.validate(obj) is called.
    • add a engine: str option, to explicitly specify which engine to use. (q: should this be in __init__ or validate or both?)

Describe the solution you'd like

Because this is quite a momentous change in the pandera's scope (to support not just pandas dataframes), I'll first re-iterate the design philosophy of pandera:

  1. minimize the proliferation of classes in the public-facing API
  2. the schema-definition interface should be isomorphic to the data structure being validated i.e. defining a dataframe schema should feel like defining a dataframe
  3. prioritize flexibility/expressiveness of validation functions, add built-ins for common checks (based on feature-parity of other similar schema libraries, or by popular request)

In keeping with these principles, I propose going with solution (2), in order to prevent an increase in the complexity and surface area of the user-facing API (DaskSchema, PandasSchema, SparkSchema, VaexSchema, etc).

edit: Actually with solution (1), one approach that would keep the API surface area small is to use a subpackage pattern that replicates the pandera interface but with the alternative backend:

from pandera.spark as pa

spark_schema = pa.DataFrameSchema({...})

class SparkSchema(pa.SchemaModel):
    ...

Etc...

import pandera.dask
import pandera.modin

Will need to think through the pros and cons of 1 vs 2 some more...

Re: data synthesis strategies, which is used purely for testing and not meant to generate massive amounts of data, we could just fallback on pandas and convert the synthesized data to the corresponding dataframe type, assuming the df library supports this, e.g. spark.createDataFrame

cosmicBboy avatar Jan 12 '21 14:01 cosmicBboy

Initial Thoughts

Currently, the schema and check classes conflate the specification of schema properties with the validation of those properties on some data. We may want to separate these two concerns.

  • DataFrameSchema collects all column types and checks and does some basic schema validations to make sure the specification is valid (raises SchemaInitError if invalid).
  • DataFrameSchema.validate should delegate the validation of some input data to a ValidationEngine. The validation engine performs the following operations:
    • checks strictness criteria, i.e. only columns specified in schema are in the dataframe (optional)
    • checks dataframe column order against schema columns order (optional)
    • coerces columns to types specified (optional)
    • expands schema regex columns based on dataframe columns
    • run schema component (column/index) checks
      • check for nulls (optional)
      • check for duplicates (optional)
      • check datatype
      • run Check validations
    • run dataframe-level checks
  • _CheckBase needs to delegate the implementation of groupby, element_wise, agg, and potentially other modifiers (see here) to the underlying dataframe library via ValidationEngine.
  • the ValidationEngine would also have to supply implementation for built-in Checks. This can happen incrementally such that an error is raised if the implementation isn't done for a particular dataframe library.
  • the strategies module needs to be extended to support other dataframe types. Since hypothesis supports numpy and pandas it makes sense to use the existing strategies logic to generate a pandas dataframe and convert it to some other desired format (e.g. koalas, modin, dask, etc) and see how far that gets us.

Here's a high-level sketch of the API:

# pandera contributor to codebase or custom third-party engine
class MySpecialDataFrameValidationEngine(ValidationEngine):
    # implement a bunch of stuff
    ...

register_validation_engine(MySpecialDataFrameValidationEngine)

# end-user interaction, with hypothetical special_dataframe package.
from special_dataframe import MySpecialDataFrame

special_df = MySpecialDataFrame(...)

schema = pa.DataFrameSchema({...})
schema.validate(special_df)

cosmicBboy avatar Apr 10 '21 20:04 cosmicBboy

checks strictness criteria, i.e. only columns specified in schema are in the dataframe (optional) checks dataframe column order against schema columns order (optional) expands schema regex columns based on dataframe columns

I think those operations can be handled by DataFrameSchema, provided that the engine exposes get_columns(df)/set_columns(df). "columns" here refers to a list of pandera.Column. edit: It occurred to me that this idea may be too restrictive for multi-dimensional dataframes (like xarray), unless DataFrameSchema knows about multi-dimensions.

We could merge the idea of Backend outlined in #369 with ValidationEngine. That would add the responsibility of registering dtypes.

Question: What to do with pandera.Index? Most DataFrame libraries don't have this concept. If we want to minimize the proliferation of classes in the public-facing API, which I totally agree with, we need to keep set_index()/reset_index() on DataFrameSchema but raise an error if the engine does not support it.

jeffzi avatar Apr 11 '21 19:04 jeffzi

Any ETA on Modin support?

crypdick avatar Jun 15 '21 21:06 crypdick

hey @crypdick once #504 is merged (should be in the next few days) I'll going to tackle this issue.

The plan right now is to make a ValidationEngine base class and PandasValidationEngine with native support for pandas, modin, and koalas.

I've done a little bit of prototyping of the new validation engine but still needs a bunch of work... I'm going to push for a finished solution before scipy conf this year, so ETA mid-July?

cosmicBboy avatar Jun 16 '21 13:06 cosmicBboy

Went through the discussion and we'd certainly be interested in contributing a Fugue ValidationEngine. We'll keep an eye out for the PandasValidationEngine and the koalas/modin support and see if Fugue has direct mappings to the implementation you arrive at!

kvnkho avatar Sep 18 '21 19:09 kvnkho

Hi, I was just wondering if it's possible to use pandera to define schemas for n-dimensional numpy arrays; and hence to use pandera with xarray.DataArray objects, just as pandera is currently used for pandas.DataFrames?

JackKelly avatar Oct 08 '21 08:10 JackKelly

@JackKelly I'd love to add support for numpy+xarray, but unfortunately it's currently not possible.

After this PR is merged (still WIP) we'll have a much better interface for extending pandera to other non-pandas data structures, numpy and xarray would be natural to support on pandera.

Out of curiosity (looking at https://github.com/openclimatefix/nowcasting_dataset/issues/211) is your primary use-case to check data types and dimensions of xarray objects?

cosmicBboy avatar Oct 11 '21 15:10 cosmicBboy

Thanks loads for the reply! No worries at all!

Yes, our primary use-case is to check the data type, dimensions, and values of xarray Datasets and DataArrays.

JackKelly avatar Oct 11 '21 16:10 JackKelly

Thanks loads for the reply! No worries at all!

Yes, our primary use-case is to check the data type, dimensions, and values of xarray Datasets and DataArrays.

Great! will keep this in mind for when we get there.

Also, once pandera schemas can be used as valid pydantic types, https://github.com/pandera-dev/pandera/issues/453 is supported, the solution you outline here would be pretty straightforward to port over to pandera, making for a pretty concise schema definition... I'm imagining a user-API like:

import pandera as pa
import pydantic

class ImageDataset(pa.SchemaModel)    
    data: DataArray[int] = NDField(dims=("time", "x", "y"))
    x_coords: Optional[DataArray[int]] = NDField(dims=("index", ))
    y_coords: Optional[DataArray[int]] = NDField(dims=("index", ))


class Example(pydantic.BaseModel):
    """A single machine learning training example."""
    satellite: Optional[ImageDataset]
    nwp: Optional[ImageDataset]

cosmicBboy avatar Oct 11 '21 16:10 cosmicBboy

That looks absolutely perfect, thank you!

JackKelly avatar Oct 12 '21 06:10 JackKelly

Hi all. I wanted to share a little experiment we've been playing with, xarray-schema, which provides schema validation logic for Xarray objects. We've been following this thread closely and we're looking at ways to integrate what we've done with pandera/pydantic.

jhamman avatar Dec 06 '21 04:12 jhamman

wow @jhamman this looks amazing! I'd love to integrate, do you want to find a time to chat? https://calendly.com/niels-bantilan/30min

Also feel free to join the discord community if you want to discuss further there: https://discord.gg/vyanhWuaKB

cosmicBboy avatar Dec 06 '21 14:12 cosmicBboy

@jhamman made this issue https://github.com/pandera-dev/pandera/issues/705 to track the xarray integration.

I'm planning on making a PR for this issue (#381) by end of year to make the xarray-schema integration as smooth as possible.

cosmicBboy avatar Dec 11 '21 15:12 cosmicBboy

Thanks for your email Niels. PETL allows one to process tables of data. It involves several differences and some advantages over Pandas:

  • The data storage is a lot more straightforward - no indices, regular Python objects (no Pandas-specific dtypes)
  • As a result, the code is much more predictable.Pandas is often quirky and leads to silent failures hard to predict. In comparison, I nearly 100% of the time get things right the first time with PETL (and I have solid experience with Pandas).
  • PETL allows you to keep only a portion of the dataframe in memory.
  • PETL is row-based, not column based, so depending on the operation, some of the processing is not available compared to Pandas. In-row and near-row operations are still possible though.
  • PETL is lazy evaluated by default, it's only at the point of producing the output that the data is pulled through its processing pipeline. This has advantages - small memory footprint - but also disadvantages - e.g. using a closure may have sometimes difficult-to-predict behavior because it actually gets executed way after its point of definition.

Overall, I think for 90% of the processing I've seen done in Pandas, PETL is a better choice. For the remaining 10% Pandas is needed more like NumPy.

Having schemas for PETL would be awesome. Its support should be much easier than for Pandas - as I mentioned, it doesn't define custom data types, the data representation model is really straightforward: lists of (lists or tuples) or any Python object.

blais avatar Feb 01 '22 01:02 blais

What would be required to ensure we can add a GeoDataFrame type from GeoPandas with a Pydantic BaseModel? I am thinking it may not be as complex as support for spark/dask and new interfaces. If someone could point me in the right direction I could work on a PR.

I would like to do:

import pandera as pa
from pandera.typing.geopandas import GeoDataFrame, GeoSeries
from pandera.typing import Series
import pydantic
from shapely.geometry import Polygon

class BaseGeoDataFrameSchema(pa.SchemaModel):
    geometry: GeoSeries
    properties: Optional[Series[str]]

class Inputs(pydantic.BaseModel):
    gdf: GeoDataFrame[BaseGeoDataFrameSchema]
   # TypeError: Fields of type "<class 'pandera.typing.geopandas.GeoDataFrame'>" are not supported.

gdf = GeoDataFrame[BaseGeoDataFrameSchema]({"geometry": [Polygon(((0, 0), (0, 1), (1, 1), (1, 0)))], "extra": [1]}, crs=4326)
validated_inputs = Inputs(gdf=gdf)

andretheronsa avatar Apr 12 '22 07:04 andretheronsa

hi all, pinging this issue to point everyone to this PR: https://github.com/unionai-oss/pandera/pull/913

It's a WIP PR for laying the groundwork for improving the extensibility of pandera's abstractions. I'd very much appreciate people's feedback on this, nothing is set in stone yet!

I'll be adding additional details to the PR description in the next few days, but for now it outlines the main changes at a high level. Please chime in with your thoughts/comments!

cosmicBboy avatar Aug 12 '22 19:08 cosmicBboy