pandera
pandera copied to clipboard
Abstract out validation logic to support non-pandas dataframes, e.g. spark, dask, etc
Is your feature request related to a problem? Please describe.
Extending pandera to non-pandas dataframe-like structures is a challenge today because the schema and schema component class definitions are strongly coupled with the pandas API. For example, the DataFramesSchema.validate
method assumes that validated objects follow the pandas API.
Potential Solutions
- Abstract out the core pandera interface into
Schema
,SchemaComponent
, andCheck
abstract base classes so that core and third-party pandera schemas can be easily developed on top of it. Subclasses of these base classes would implement the validation logic for a specific library, e.g.SparkSchema
,PandasSchema
, etc. - Provide a validation engine interface where core and third-party developers can register and use different validation backends depending on the type of dataframe implementation (e.g. pandas, spark, dask, etc) being used, similar to the proposal in #369. The public-facing API won't change: different dataframe types would be validated via different (non-mutually exclusive) approaches:
- at runtime validation, pandera delegates to the appropriate engine based on the type of
obj
whenschema.validate(obj)
is called. - add a
engine: str
option, to explicitly specify which engine to use. (q: should this be in__init__
orvalidate
or both?)
- at runtime validation, pandera delegates to the appropriate engine based on the type of
Describe the solution you'd like
Because this is quite a momentous change in the pandera's scope (to support not just pandas dataframes), I'll first re-iterate the design philosophy of pandera:
- minimize the proliferation of classes in the public-facing API
- the schema-definition interface should be isomorphic to the data structure being validated i.e. defining a dataframe schema should feel like defining a dataframe
- prioritize flexibility/expressiveness of validation functions, add built-ins for common checks (based on feature-parity of other similar schema libraries, or by popular request)
In keeping with these principles, I propose going with solution (2), in order to prevent an increase in the complexity and surface area of the user-facing API (DaskSchema
, PandasSchema
, SparkSchema
, VaexSchema
, etc).
edit: Actually with solution (1), one approach that would keep the API surface area small is to use a subpackage pattern that replicates the pandera interface but with the alternative backend:
from pandera.spark as pa
spark_schema = pa.DataFrameSchema({...})
class SparkSchema(pa.SchemaModel):
...
Etc...
import pandera.dask
import pandera.modin
Will need to think through the pros and cons of 1 vs 2 some more...
Re: data synthesis strategies, which is used purely for testing and not meant to generate massive amounts of data, we could just fallback on pandas and convert the synthesized data to the corresponding dataframe type, assuming the df library supports this, e.g. spark.createDataFrame
Initial Thoughts
Currently, the schema and check classes conflate the specification of schema properties with the validation of those properties on some data. We may want to separate these two concerns.
-
DataFrameSchema
collects all column types and checks and does some basic schema validations to make sure the specification is valid (raisesSchemaInitError
if invalid). -
DataFrameSchema.validate
should delegate the validation of some input data to aValidationEngine
. The validation engine performs the following operations:- checks strictness criteria, i.e. only columns specified in schema are in the dataframe (optional)
- checks dataframe column order against schema columns order (optional)
- coerces columns to types specified (optional)
- expands schema regex columns based on dataframe columns
- run schema component (column/index) checks
- check for nulls (optional)
- check for duplicates (optional)
- check datatype
- run
Check
validations
- run dataframe-level checks
-
_CheckBase
needs to delegate the implementation ofgroupby
,element_wise
,agg
, and potentially other modifiers (see here) to the underlying dataframe library viaValidationEngine
. - the
ValidationEngine
would also have to supply implementation for built-inCheck
s. This can happen incrementally such that an error is raised if the implementation isn't done for a particular dataframe library. - the
strategies
module needs to be extended to support other dataframe types. Sincehypothesis
supports numpy and pandas it makes sense to use the existing strategies logic to generate a pandas dataframe and convert it to some other desired format (e.g. koalas, modin, dask, etc) and see how far that gets us.
Here's a high-level sketch of the API:
# pandera contributor to codebase or custom third-party engine
class MySpecialDataFrameValidationEngine(ValidationEngine):
# implement a bunch of stuff
...
register_validation_engine(MySpecialDataFrameValidationEngine)
# end-user interaction, with hypothetical special_dataframe package.
from special_dataframe import MySpecialDataFrame
special_df = MySpecialDataFrame(...)
schema = pa.DataFrameSchema({...})
schema.validate(special_df)
checks strictness criteria, i.e. only columns specified in schema are in the dataframe (optional) checks dataframe column order against schema columns order (optional) expands schema regex columns based on dataframe columns
I think those operations can be handled by DataFrameSchema
, provided that the engine exposes get_columns(df)/set_columns(df)
. "columns" here refers to a list of pandera.Column
.
edit: It occurred to me that this idea may be too restrictive for multi-dimensional dataframes (like xarray), unless DataFrameSchema knows about multi-dimensions.
We could merge the idea of Backend
outlined in #369 with ValidationEngine
. That would add the responsibility of registering dtypes.
Question: What to do with pandera.Index
? Most DataFrame libraries don't have this concept. If we want to minimize the proliferation of classes in the public-facing API, which I totally agree with, we need to keep set_index()/reset_index()
on DataFrameSchema
but raise an error if the engine does not support it.
Any ETA on Modin support?
hey @crypdick once #504 is merged (should be in the next few days) I'll going to tackle this issue.
The plan right now is to make a ValidationEngine
base class and PandasValidationEngine
with native support for pandas, modin, and koalas.
I've done a little bit of prototyping of the new validation engine but still needs a bunch of work... I'm going to push for a finished solution before scipy conf this year, so ETA mid-July?
Went through the discussion and we'd certainly be interested in contributing a Fugue ValidationEngine
. We'll keep an eye out for the PandasValidationEngine
and the koalas/modin support and see if Fugue has direct mappings to the implementation you arrive at!
Hi, I was just wondering if it's possible to use pandera
to define schemas for n-dimensional numpy arrays; and hence to use pandera with xarray.DataArray
objects, just as pandera
is currently used for pandas.DataFrames
?
@JackKelly I'd love to add support for numpy+xarray, but unfortunately it's currently not possible.
After this PR is merged (still WIP) we'll have a much better interface for extending pandera to other non-pandas data structures, numpy and xarray would be natural to support on pandera.
Out of curiosity (looking at https://github.com/openclimatefix/nowcasting_dataset/issues/211) is your primary use-case to check data types and dimensions of xarray objects?
Thanks loads for the reply! No worries at all!
Yes, our primary use-case is to check the data type, dimensions, and values of xarray Datasets and DataArrays.
Thanks loads for the reply! No worries at all!
Yes, our primary use-case is to check the data type, dimensions, and values of xarray Datasets and DataArrays.
Great! will keep this in mind for when we get there.
Also, once pandera schemas can be used as valid pydantic types, https://github.com/pandera-dev/pandera/issues/453 is supported, the solution you outline here would be pretty straightforward to port over to pandera, making for a pretty concise schema definition... I'm imagining a user-API like:
import pandera as pa
import pydantic
class ImageDataset(pa.SchemaModel)
data: DataArray[int] = NDField(dims=("time", "x", "y"))
x_coords: Optional[DataArray[int]] = NDField(dims=("index", ))
y_coords: Optional[DataArray[int]] = NDField(dims=("index", ))
class Example(pydantic.BaseModel):
"""A single machine learning training example."""
satellite: Optional[ImageDataset]
nwp: Optional[ImageDataset]
That looks absolutely perfect, thank you!
Hi all. I wanted to share a little experiment we've been playing with, xarray-schema, which provides schema validation logic for Xarray objects. We've been following this thread closely and we're looking at ways to integrate what we've done with pandera/pydantic.
wow @jhamman this looks amazing! I'd love to integrate, do you want to find a time to chat? https://calendly.com/niels-bantilan/30min
Also feel free to join the discord community if you want to discuss further there: https://discord.gg/vyanhWuaKB
@jhamman made this issue https://github.com/pandera-dev/pandera/issues/705 to track the xarray integration.
I'm planning on making a PR for this issue (#381) by end of year to make the xarray-schema integration as smooth as possible.
Thanks for your email Niels. PETL allows one to process tables of data. It involves several differences and some advantages over Pandas:
- The data storage is a lot more straightforward - no indices, regular Python objects (no Pandas-specific dtypes)
- As a result, the code is much more predictable.Pandas is often quirky and leads to silent failures hard to predict. In comparison, I nearly 100% of the time get things right the first time with PETL (and I have solid experience with Pandas).
- PETL allows you to keep only a portion of the dataframe in memory.
- PETL is row-based, not column based, so depending on the operation, some of the processing is not available compared to Pandas. In-row and near-row operations are still possible though.
- PETL is lazy evaluated by default, it's only at the point of producing the output that the data is pulled through its processing pipeline. This has advantages - small memory footprint - but also disadvantages - e.g. using a closure may have sometimes difficult-to-predict behavior because it actually gets executed way after its point of definition.
Overall, I think for 90% of the processing I've seen done in Pandas, PETL is a better choice. For the remaining 10% Pandas is needed more like NumPy.
Having schemas for PETL would be awesome. Its support should be much easier than for Pandas - as I mentioned, it doesn't define custom data types, the data representation model is really straightforward: lists of (lists or tuples) or any Python object.
What would be required to ensure we can add a GeoDataFrame type from GeoPandas with a Pydantic BaseModel? I am thinking it may not be as complex as support for spark/dask and new interfaces. If someone could point me in the right direction I could work on a PR.
I would like to do:
import pandera as pa
from pandera.typing.geopandas import GeoDataFrame, GeoSeries
from pandera.typing import Series
import pydantic
from shapely.geometry import Polygon
class BaseGeoDataFrameSchema(pa.SchemaModel):
geometry: GeoSeries
properties: Optional[Series[str]]
class Inputs(pydantic.BaseModel):
gdf: GeoDataFrame[BaseGeoDataFrameSchema]
# TypeError: Fields of type "<class 'pandera.typing.geopandas.GeoDataFrame'>" are not supported.
gdf = GeoDataFrame[BaseGeoDataFrameSchema]({"geometry": [Polygon(((0, 0), (0, 1), (1, 1), (1, 0)))], "extra": [1]}, crs=4326)
validated_inputs = Inputs(gdf=gdf)
hi all, pinging this issue to point everyone to this PR: https://github.com/unionai-oss/pandera/pull/913
It's a WIP PR for laying the groundwork for improving the extensibility of pandera's abstractions. I'd very much appreciate people's feedback on this, nothing is set in stone yet!
I'll be adding additional details to the PR description in the next few days, but for now it outlines the main changes at a high level. Please chime in with your thoughts/comments!