pandera Missing `reason_code` when using custom checks with PySpark dataframes

Missing `reason_code` when using custom checks with PySpark dataframes

Open melvinkokxw opened this issue 9 months ago • 1 comments

Describe the bug Using a custom check with a PySpark dataframe raises the exception AttributeError: 'NoneType' object has no attribute 'name'

The cause for this is that reason_code is not provided raising SchemaError after a failed custom check, specifically here: https://github.com/unionai-oss/pandera/blob/d2bfed03e107358d60266108478711cdbe704e9c/pandera/backends/pyspark/base.py#L99-L107

And when collecting errors here: https://github.com/unionai-oss/pandera/blob/d2bfed03e107358d60266108478711cdbe704e9c/pandera/api/base/error_handler.py#L127

Trying to access .name on the non-existent reason_code (i.e. None) causes an AttributeError.

[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandera.
[ ] (optional) I have confirmed this bug exists on the main branch of pandera.

Code Sample, a copy-pastable example

import pandera.pyspark as psa
import pyspark.sql as ps
from pandera.extensions import register_check_method
from pyspark.sql import types as T

@register_check_method
def custom_check(pyspark_df: ps.DataFrame):
    return False

class Schema(psa.DataFrameModel):
    field1: T.IntegerType() = psa.Field()
    field2: T.IntegerType() = psa.Field()

    class Config:
        custom_check = ()

spark = ps.SparkSession.builder.appName("example").getOrCreate()

schema = T.StructType([
   T.StructField("field1", T.IntegerType(), True),
   T.StructField("field2", T.IntegerType(), True)])

data = [(1, 2)]

df = spark.createDataFrame(data, schema)
Schema.validate(df)

Expected behavior

Validation should fail, and raise a SchemaError (or SchemaErrors?) but not an AttributeError

Desktop (please complete the following information):

OS: macos
Browser: NA
Version: pandera 0.19.3 (also exists on 0.19.0)

May 15 '24 05:05 melvinkokxw

I just stumbled upon the same issue. Glad someone already reported it and hopefully, it can be fixed soon...

May 30 '24 06:05 MatthiasRoels

pandera pandera copied to clipboard

Missing `reason_code` when using custom checks with PySpark dataframes

Code Sample, a copy-pastable example

Expected behavior

Desktop (please complete the following information):

pandera
pandera copied to clipboard