pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Idea: DataFrame Validation State Caching For Runtime Optimization - Only Validate What Needs To Be Validated!

Open lior5654 opened this issue 2 years ago • 2 comments
trafficstars

Motivation Suppose you have a module with 20 functions that recieve a dataframe of the same schema (for instance, each plots some graph based on the dataframe). Because you want to ensure correctness, you use @pa.check_types on each of them

Now suppose you are a user of the module described above, and you call 10 of these functions on a dataframe. If I understand correctly, the dataframe will be validated 10 times in this situation.

The Idea The idea consists of two parts:

  1. Cache a state of whether a dataframe was validated, and hasn't been modified since
  2. Advanced: track which columns were modified and avoid re-validating un-modified columns with validations that only involve them.

lior5654 avatar Oct 03 '23 13:10 lior5654

The above suggestion may have significant when doing DS research on big data or when some validations are abnormally computationally heavy.

lior5654 avatar Oct 03 '23 13:10 lior5654

A basic version of this is already done by the check_types decorator 😃 see here:

  • Every time schema.validate is called, a DataFrame.pandera.add_schema is invoked to add a schema to the underlying object, which is accessible via df.pandera.schema.
  • If check_types check types encounters a df.pandera.schema that's the same as the one defined in the function type hint, pandera will skip validation.
  • This works well for dataframes that are not mutated in place, since transformations on the validated dataframe will reset the df.pandera accessor, meaning that df.pandera.schema will be none.

I don't think there are actually unit tests for this, so feel free to try it out and add unit tests and documentation for this feature.

Advanced: track which columns were modified and avoid re-validating un-modified columns with validations that only involve them.

Why not just validate the dataframe once, then use the dataframe in the downstream functions? This seems like a complicated feature but happy to be proven wrong 😅. If you can create a small proof of concept gist or PR would be happy to review and discuss. FYI there is a pandas_accessor module that you can use to add arbitrary metadata to the dataframe -- the main issue is that any copy-based transformation (anything that involves method chaining) will blow away any state stored in the accessor.

cosmicBboy avatar Oct 03 '23 16:10 cosmicBboy