pandera icon indicating copy to clipboard operation
pandera copied to clipboard

How to properly validate a `polars.LazyFrame`?

Open csubhodeep opened this issue 1 year ago • 2 comments

Question about pandera

Hello pandera community, I am trying out pandera to validate a normal polars.LazyFrame as described in the first example in the docs.

Now if I understood the docs correctly, by design, calling the validate method on the LazyFrame would only check the schema. I have the following questions:

  1. What is the extra benefit here for the user to declare a pandera.DataFrameSchema when they can just use the == operator to compare the schema with a pre-defined polars.Schema object?
  2. Now in case we want to do in-depth data validation on the LazyFrame we should call the collect method on it but then if in a situation we have, let's say, 50 columns but in the pandera.DataFrameSchema we have 3 columns then does it make sense to pull the rest 50 columns in-memory?

Would it make more sense to do control this behaviour inside the validate method, this way pandera could add a projection on columns selecting only the ones that have been defined in the pandera.DataFrameSchema and then maybe execute the validation checks/logics and then finally call the collect internally instead of asking the user to call collect before doing the validations.

csubhodeep avatar Aug 04 '24 15:08 csubhodeep

For example (2), can't you just select the columns you want to validate before collecting?

butterlyn avatar Aug 06 '24 12:08 butterlyn

For example (2), can't you just select the columns you want to validate before collecting?

@butterlyn do we do the same for pandas? If not, then I am not sure why we need to make an exception wrt the usage only for polars

csubhodeep avatar Aug 06 '24 16:08 csubhodeep