pandera
pandera copied to clipboard
WhyLogs integration: support stateful pandera schemas backed by whylogs profiles
Is your feature request related to a problem? Please describe.
Currently, pandera schemas are stateless: the schema only validates based on rules that are fully defined in code.
This is great, but it does close off many use cases that rely on data/aggregates of data that pass through a particular checkpoint in a user's data processing pipeline.
Describe the solution you'd like
With whylogs profiles, you can aggregate data in batch or streaming fashion into profiles, (e.g. the mean value of a column in a dataframe), and pandera can apply validation rules to both the actual data flowing through the pipeline and the data profile that whylogs produces, which could potentially span all of the data that's passed through a particular checkpoint.