pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Pandera strategy re-write: improve base implementation and add API for custom strategies and global schema-level override strategy

Open cosmicBboy opened this issue 3 years ago • 4 comments

Is your feature request related to a problem? Please describe.

Currently, strategies are limited by the hypothesis.extras.pandas convention of how to define a dataframe. Namely, the strategy used to generate data values are at the element-level. This makes it hard to create strategies for a whole column or those that model the dependencies between columns.

For previous context on the problem with strategies, see #1605, #1220, #1275.

Describe the solution you'd like

We need a re-write! 🔥

As described in #1605, the requirements for a pandera pandas strategy rewrite are:

  • Strategies that work for all pandera schemas (this is a really high bar, but I think possible), with reasonable escape hatches when pandera cannot automatically figure out how to generate a df.
  • Generating entire columns instead of individual elements
  • Incorporating cross-column dependencies
  • A user-friendly way of overriding strategies (from pre-existing Checks) or custom strategies
  • Columns with multiple checks should not chain strategies with filter, it should maybe override data with the new constraint.
  • ... (others?)

More context on the current state

At a high level, this is how pandera currently translates a schema to a hypothesis strategy:

  • For each column, index, obtain the following metadata:
    • Column name, datatype, and checks
  • If the column name is a regex expression, generate column names based on the regex
  • Define a hypothesis column. This contains the datatypes, elements, and other properties of the column.
  • Based on the pa.Column dtype, properties (e.g. unique), and first check in the list of check, forward them to the hypothesis column. This creates an element strategy for a single value in that column.
  • For any subsequent Check in the list, get their check stats (constraint values) and chain them to the element strategy with filter (this really sucks, i.e. slows down performance.)

cosmicBboy avatar Jul 15 '21 14:07 cosmicBboy

Following up on the discussion in https://github.com/pandera-dev/pandera/discussions/648.

There are often use cases where it would be useful to override the base strategy for a column. The following hypothesis strategies clearly express the shape of data, but cannot be easily represented using the pandera check API.

# uuids
st.uuids().map(str)

# dictionaries
st.fixed_dictionaries(
    {
        'symbol': st.text(string.ascii_uppercase),
         'cusip': st.text(string.ascii_uppercase + string.digits),
    },
)

A workaround described in https://github.com/pandera-dev/pandera/discussions/648 uses custom check methods to store a strategy override for later use and accesses it during strategy generation in a subclass of pandera.DataFrameSchema. This approach does not support column checks as the entire column strategy is replaced by the strategy specified in the field.

As suggested by @cosmicBboy in https://github.com/pandera-dev/pandera/discussions/648, first class support for this use case could be added by adding a strategy or base_strategy parameter to pandera.Field and passing this user-provided strategy to the field_element_strategy method. field_element_strategy would need to be updated to support passing a base strategy, rather than creating the base strategy by looking at the column's dtype.

This would allow for the following schema specification, while still supporting additional checks on the column (unlike the workaround described above).

class Schema(SchemaModel):
    uuids: Series[object] = pa.Field(strategy=st.uuids().map(str))

bphillips-exos avatar Oct 30 '21 14:10 bphillips-exos

Following from https://github.com/unionai-oss/pandera/discussions/1088

Perhaps not exactly what you had in mind but... a rather simple brute-force approach: create a strategy with hypothesis that generates the whole dataframe, and feed it in the schema as the one to use to generate examples.

What do you think of this @cosmicBboy ? It could be something relatively simple to implement (if it fits your design choices)...? If so, I volunteer to create a PR for this.

francesco086 avatar Feb 21 '23 12:02 francesco086