pandera
pandera copied to clipboard
Pandera strategy re-write: improve base implementation and add API for custom strategies and global schema-level override strategy
Is your feature request related to a problem? Please describe.
Currently, strategies are limited by the hypothesis.extras.pandas
convention of how to define a dataframe. Namely, the strategy used to generate data values are at the element-level. This makes it hard to create strategies for a whole column or those that model the dependencies between columns.
For previous context on the problem with strategies, see #1605, #1220, #1275.
Describe the solution you'd like
We need a re-write! 🔥
As described in #1605, the requirements for a pandera pandas strategy rewrite are:
- Strategies that work for all pandera schemas (this is a really high bar, but I think possible), with reasonable escape hatches when pandera cannot automatically figure out how to generate a df.
- Generating entire columns instead of individual elements
- Incorporating cross-column dependencies
- A user-friendly way of overriding strategies (from pre-existing Checks) or custom strategies
- Columns with multiple checks should not chain strategies with filter, it should maybe override data with the new constraint.
- ... (others?)
More context on the current state
At a high level, this is how pandera currently translates a schema to a hypothesis strategy:
- For each column, index, obtain the following metadata:
- Column name, datatype, and checks
- If the column name is a regex expression, generate column names based on the regex
- Define a hypothesis
column
. This contains the datatypes, elements, and other properties of the column. - Based on the
pa.Column
dtype, properties (e.g. unique), and first check in the list ofcheck
, forward them to the hypothesis column. This creates an element strategy for a single value in that column. - For any subsequent
Check
in the list, get their check stats (constraint values) and chain them to the element strategy withfilter
(this really sucks, i.e. slows down performance.)
Following up on the discussion in https://github.com/pandera-dev/pandera/discussions/648.
There are often use cases where it would be useful to override the base strategy for a column. The following hypothesis strategies clearly express the shape of data, but cannot be easily represented using the pandera check API.
# uuids
st.uuids().map(str)
# dictionaries
st.fixed_dictionaries(
{
'symbol': st.text(string.ascii_uppercase),
'cusip': st.text(string.ascii_uppercase + string.digits),
},
)
A workaround described in https://github.com/pandera-dev/pandera/discussions/648 uses custom check methods to store a strategy override for later use and accesses it during strategy generation in a subclass of pandera.DataFrameSchema
. This approach does not support column checks as the entire column strategy is replaced by the strategy specified in the field.
As suggested by @cosmicBboy in https://github.com/pandera-dev/pandera/discussions/648, first class support for this use case could be added by adding a strategy
or base_strategy
parameter to pandera.Field
and passing this user-provided strategy to the field_element_strategy method. field_element_strategy
would need to be updated to support passing a base strategy, rather than creating the base strategy by looking at the column's dtype.
This would allow for the following schema specification, while still supporting additional checks on the column (unlike the workaround described above).
class Schema(SchemaModel):
uuids: Series[object] = pa.Field(strategy=st.uuids().map(str))
Following from https://github.com/unionai-oss/pandera/discussions/1088
Perhaps not exactly what you had in mind but... a rather simple brute-force approach: create a strategy with hypothesis that generates the whole dataframe, and feed it in the schema as the one to use to generate examples.
What do you think of this @cosmicBboy ? It could be something relatively simple to implement (if it fits your design choices)...? If so, I volunteer to create a PR for this.