pandera
pandera copied to clipboard
Column Order Validation using Pyspark SQL Data Validation is not Working.
Column order validation is not working in Pyspark SQL Data Validation.
- [x] I have checked that this issue has not already been reported.
- [x] I have confirmed this bug exists on the latest version of pandera.
- [ ] (optional) I have confirmed this bug exists on the master branch of pandera.
Code Sample, a copy-pastable example
Create a Basic Model with column order checking enabled using the example shown in the documentation as a guide.
import pandera.pyspark as pa
import pyspark.sql.types as T
from pyspark.sql import DataFrame
from pyspark.sql import SparkSession
from pandera.pyspark import DataFrameModel
spark = SparkSession.builder.getOrCreate()
class PanderaSchema(DataFrameModel):
id: T.IntegerType() = pa.Field(gt=5)
product: T.StringType() = pa.Field(str_startswith="B")
class Config:
name = "BaseSchema"
# My understanding is that this should tell Pandera to validate the column order
ordered = True
# Create a Spark Dataframe with Columns in the Wrong Order
data = [
("Bread", 6),
("Butter", 15),
]
wrong_order_schema = T.StructType(
[
# Columns are out of order
T.StructField("product", T.StringType(), False),
T.StructField("id", T.IntegerType(), False),
],
)
df = spark.createDataFrame(data, wrong_order_schema)
# Validate the Dataframe
validation_results = PanderaSchema.validate(df)
validation_result.pandera.errors
Resulting Output
{}
Expected behavior
The columns in the wrong order should be detected and there should be an error within the validation_result.pandera.errors
object.
Desktop (please complete the following information):
- Executed in a Databricks Notebook, Runtime: 10.4 LTS ML (includes Apache Spark 3.2.1, Scala 2.12)
-
pandera[pyspark]==0.16.1
installed as an additional Library on the Spark Notebook cluster - Python version: 3.8.10
hi @juskaiser can you try installing the development version of pandera? This issue should be addressed on the main
branch:
# Validate the Dataframe
validation_results = PanderaSchema.validate(df)
print(validation_results.pandera.errors)
defaultdict(<function ErrorHandler.__init__.<locals>.<lambda> at 0x7f9a6a6e0d30>, {'SCHEMA': defaultdict(<class 'list'>, {'COLUMN_NOT_ORDERED': [{'schema': 'BaseSchema', 'column': 'BaseSchema', 'check': 'column_ordered', 'error': "column 'product' out-of-order"}, {'schema': 'BaseSchema', 'column': 'BaseSchema', 'check': 'column_ordered', 'error': "column 'id' out-of-order"}]})})
will be cutting a 0.17.0 tomorrow!
@cosmicBboy, I'm having trouble getting the requirements-dev to install on my spark cluster, but will test the new release when it is available. Thanks!