pandera Support conversion from Pandera DataFrameSchema or DataFrameModel to PySpark StructType

Support conversion from Pandera DataFrameSchema or DataFrameModel to PySpark StructType

Open dom-mcloughlin opened this issue 1 year ago • 2 comments

When reading data into a PySpark DataFrame, you can pass an input schema (StructType, which is an iterator over StructField types). If there is no input schema, Spark is forced to try to infer a schema, which is a costly process, and less explicit.

To avoid duplication between defining a PySpark input schema and a pandera validation schema, it would be neat to be able to convert between the two.

Another possible solution would be to update the PySpark reader so it also accepts Pandera types, but it feels like a pandera responsibility to match Spark's existing standard.

Sep 01 '23 08:09 dom-mcloughlin

My team is also looking at this and would really like it as a feature.

I have written a proposed solution that allows you to define a schema with pandera.pyspark and then create a PySpark DataFrame from any schema definition without having to explicitly define the StructType. It also works with inheritance.

Please see the issue here: Support conversion from DataFrameModel to PySpark StructType #1434. Hopefully you can use the proposed solution in the interim.

Nov 28 '23 05:11 Garett601

@dom-mcloughlin , I just opened the #1570 PR, could you take a look at it, please?

Apr 12 '24 13:04 filipeo2-mck

pandera pandera copied to clipboard

Support conversion from Pandera DataFrameSchema or DataFrameModel to PySpark StructType

pandera
pandera copied to clipboard