pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Column Order Validation using Pyspark SQL Data Validation is not Working.

Open juskaiser opened this issue 1 year ago • 3 comments

Column order validation is not working in Pyspark SQL Data Validation.

  • [x] I have checked that this issue has not already been reported.
  • [x] I have confirmed this bug exists on the latest version of pandera.
  • [ ] (optional) I have confirmed this bug exists on the master branch of pandera.

Code Sample, a copy-pastable example

Create a Basic Model with column order checking enabled using the example shown in the documentation as a guide.

import pandera.pyspark as pa
import pyspark.sql.types as T
from pyspark.sql import DataFrame
from pyspark.sql import SparkSession
from pandera.pyspark import DataFrameModel

spark = SparkSession.builder.getOrCreate()

class PanderaSchema(DataFrameModel):
    id: T.IntegerType() = pa.Field(gt=5)
    product: T.StringType() = pa.Field(str_startswith="B")

    class Config:
        name = "BaseSchema"
        #  My understanding is that this should tell Pandera to validate the column order
        ordered = True

# Create a Spark Dataframe with Columns in the Wrong Order
data = [
    ("Bread", 6),
    ("Butter", 15),
]

wrong_order_schema = T.StructType(
    [
        # Columns are out of order
        T.StructField("product", T.StringType(), False),
        T.StructField("id", T.IntegerType(), False),
    ],
)

df = spark.createDataFrame(data, wrong_order_schema)

# Validate the Dataframe
validation_results = PanderaSchema.validate(df)
validation_result.pandera.errors

Resulting Output

{}

Expected behavior

The columns in the wrong order should be detected and there should be an error within the validation_result.pandera.errors object.

Desktop (please complete the following information):

  • Executed in a Databricks Notebook, Runtime: 10.4 LTS ML (includes Apache Spark 3.2.1, Scala 2.12)
  • pandera[pyspark]==0.16.1 installed as an additional Library on the Spark Notebook cluster
  • Python version: 3.8.10

juskaiser avatar Sep 21 '23 17:09 juskaiser

hi @juskaiser can you try installing the development version of pandera? This issue should be addressed on the main branch:

# Validate the Dataframe
validation_results = PanderaSchema.validate(df)
print(validation_results.pandera.errors)
defaultdict(<function ErrorHandler.__init__.<locals>.<lambda> at 0x7f9a6a6e0d30>, {'SCHEMA': defaultdict(<class 'list'>, {'COLUMN_NOT_ORDERED': [{'schema': 'BaseSchema', 'column': 'BaseSchema', 'check': 'column_ordered', 'error': "column 'product' out-of-order"}, {'schema': 'BaseSchema', 'column': 'BaseSchema', 'check': 'column_ordered', 'error': "column 'id' out-of-order"}]})})

cosmicBboy avatar Sep 21 '23 17:09 cosmicBboy

will be cutting a 0.17.0 tomorrow!

cosmicBboy avatar Sep 21 '23 17:09 cosmicBboy

@cosmicBboy, I'm having trouble getting the requirements-dev to install on my spark cluster, but will test the new release when it is available. Thanks!

juskaiser avatar Sep 21 '23 18:09 juskaiser