pandera
pandera copied to clipboard
Idea: DataFrame Validation State Caching For Runtime Optimization - Only Validate What Needs To Be Validated!
Motivation Suppose you have a module with 20 functions that recieve a dataframe of the same schema (for instance, each plots some graph based on the dataframe). Because you want to ensure correctness, you use @pa.check_types on each of them
Now suppose you are a user of the module described above, and you call 10 of these functions on a dataframe. If I understand correctly, the dataframe will be validated 10 times in this situation.
The Idea The idea consists of two parts:
- Cache a state of whether a dataframe was validated, and hasn't been modified since
- Advanced: track which columns were modified and avoid re-validating un-modified columns with validations that only involve them.
The above suggestion may have significant when doing DS research on big data or when some validations are abnormally computationally heavy.
A basic version of this is already done by the check_types decorator 😃 see here:
- Every time
schema.validateis called, aDataFrame.pandera.add_schemais invoked to add a schema to the underlying object, which is accessible viadf.pandera.schema. - If
check_typescheck types encounters adf.pandera.schemathat's the same as the one defined in the function type hint, pandera will skip validation. - This works well for dataframes that are not mutated in place, since transformations on the validated dataframe will reset the
df.panderaaccessor, meaning thatdf.pandera.schemawill be none.
I don't think there are actually unit tests for this, so feel free to try it out and add unit tests and documentation for this feature.
Advanced: track which columns were modified and avoid re-validating un-modified columns with validations that only involve them.
Why not just validate the dataframe once, then use the dataframe in the downstream functions? This seems like a complicated feature but happy to be proven wrong 😅. If you can create a small proof of concept gist or PR would be happy to review and discuss. FYI there is a pandas_accessor module that you can use to add arbitrary metadata to the dataframe -- the main issue is that any copy-based transformation (anything that involves method chaining) will blow away any state stored in the accessor.