pandera
pandera copied to clipboard
Support conversion from Pandera DataFrameSchema or DataFrameModel to PySpark StructType
When reading data into a PySpark DataFrame, you can pass an input schema (StructType, which is an iterator over StructField types). If there is no input schema, Spark is forced to try to infer a schema, which is a costly process, and less explicit.
To avoid duplication between defining a PySpark input schema and a pandera validation schema, it would be neat to be able to convert between the two.
Another possible solution would be to update the PySpark reader so it also accepts Pandera types, but it feels like a pandera responsibility to match Spark's existing standard.
My team is also looking at this and would really like it as a feature.
I have written a proposed solution that allows you to define a schema with pandera.pyspark
and then create a PySpark DataFrame from any schema definition without having to explicitly define the StructType. It also works with inheritance.
Please see the issue here: Support conversion from DataFrameModel to PySpark StructType #1434. Hopefully you can use the proposed solution in the interim.
@dom-mcloughlin , I just opened the #1570 PR, could you take a look at it, please?