pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Support conversion from Pandera DataFrameSchema or DataFrameModel to PySpark StructType

Open dom-mcloughlin opened this issue 1 year ago • 2 comments

When reading data into a PySpark DataFrame, you can pass an input schema (StructType, which is an iterator over StructField types). If there is no input schema, Spark is forced to try to infer a schema, which is a costly process, and less explicit.

To avoid duplication between defining a PySpark input schema and a pandera validation schema, it would be neat to be able to convert between the two.

Another possible solution would be to update the PySpark reader so it also accepts Pandera types, but it feels like a pandera responsibility to match Spark's existing standard.

dom-mcloughlin avatar Sep 01 '23 08:09 dom-mcloughlin

My team is also looking at this and would really like it as a feature.

I have written a proposed solution that allows you to define a schema with pandera.pyspark and then create a PySpark DataFrame from any schema definition without having to explicitly define the StructType. It also works with inheritance.

Please see the issue here: Support conversion from DataFrameModel to PySpark StructType #1434. Hopefully you can use the proposed solution in the interim.

Garett601 avatar Nov 28 '23 05:11 Garett601

@dom-mcloughlin , I just opened the #1570 PR, could you take a look at it, please?

filipeo2-mck avatar Apr 12 '24 13:04 filipeo2-mck