How to properly validate a `polars.LazyFrame`?
Question about pandera
Hello pandera community, I am trying out pandera to validate a normal polars.LazyFrame as described in the first example in the docs.
Now if I understood the docs correctly, by design, calling the validate method on the LazyFrame would only check the schema. I have the following questions:
- What is the extra benefit here for the user to declare a
pandera.DataFrameSchemawhen they can just use the==operator to compare the schema with a pre-definedpolars.Schemaobject? - Now in case we want to do in-depth data validation on the
LazyFramewe should call thecollectmethod on it but then if in a situation we have, let's say, 50 columns but in thepandera.DataFrameSchemawe have 3 columns then does it make sense to pull the rest 50 columns in-memory?
Would it make more sense to do control this behaviour inside the validate method, this way pandera could add a projection on columns selecting only the ones that have been defined in the pandera.DataFrameSchema and then maybe execute the validation checks/logics and then finally call the collect internally instead of asking the user to call collect before doing the validations.
For example (2), can't you just select the columns you want to validate before collecting?
For example (2), can't you just select the columns you want to validate before collecting?
@butterlyn do we do the same for pandas? If not, then I am not sure why we need to make an exception wrt the usage only for polars