pandera Idea: DataFrame Validation State Caching For Runtime Optimization - Only Validate What Needs To Be Validated!

Idea: DataFrame Validation State Caching For Runtime Optimization - Only Validate What Needs To Be Validated!

Open lior5654 opened this issue 2 years ago • 2 comments

trafficstars

Motivation Suppose you have a module with 20 functions that recieve a dataframe of the same schema (for instance, each plots some graph based on the dataframe). Because you want to ensure correctness, you use @pa.check_types on each of them

Now suppose you are a user of the module described above, and you call 10 of these functions on a dataframe. If I understand correctly, the dataframe will be validated 10 times in this situation.

The Idea The idea consists of two parts:

Cache a state of whether a dataframe was validated, and hasn't been modified since
Advanced: track which columns were modified and avoid re-validating un-modified columns with validations that only involve them.

Oct 03 '23 13:10 lior5654

The above suggestion may have significant when doing DS research on big data or when some validations are abnormally computationally heavy.

Oct 03 '23 13:10 lior5654

A basic version of this is already done by the check_types decorator 😃 see here:

Every time schema.validate is called, a DataFrame.pandera.add_schema is invoked to add a schema to the underlying object, which is accessible via df.pandera.schema.
If check_types check types encounters a df.pandera.schema that's the same as the one defined in the function type hint, pandera will skip validation.
This works well for dataframes that are not mutated in place, since transformations on the validated dataframe will reset the df.pandera accessor, meaning that df.pandera.schema will be none.

I don't think there are actually unit tests for this, so feel free to try it out and add unit tests and documentation for this feature.

Advanced: track which columns were modified and avoid re-validating un-modified columns with validations that only involve them.

Why not just validate the dataframe once, then use the dataframe in the downstream functions? This seems like a complicated feature but happy to be proven wrong 😅. If you can create a small proof of concept gist or PR would be happy to review and discuss. FYI there is a pandas_accessor module that you can use to add arbitrary metadata to the dataframe -- the main issue is that any copy-based transformation (anything that involves method chaining) will blow away any state stored in the accessor.

Oct 03 '23 16:10 cosmicBboy

pandera pandera copied to clipboard

Idea: DataFrame Validation State Caching For Runtime Optimization - Only Validate What Needs To Be Validated!

pandera
pandera copied to clipboard