pandera
pandera copied to clipboard
Missing `reason_code` when using custom checks with PySpark dataframes
Describe the bug
Using a custom check with a PySpark dataframe raises the exception AttributeError: 'NoneType' object has no attribute 'name'
The cause for this is that reason_code
is not provided raising SchemaError after a failed custom check, specifically here: https://github.com/unionai-oss/pandera/blob/d2bfed03e107358d60266108478711cdbe704e9c/pandera/backends/pyspark/base.py#L99-L107
And when collecting errors here: https://github.com/unionai-oss/pandera/blob/d2bfed03e107358d60266108478711cdbe704e9c/pandera/api/base/error_handler.py#L127
Trying to access .name
on the non-existent reason_code
(i.e. None
) causes an AttributeError
.
- [x] I have checked that this issue has not already been reported.
- [x] I have confirmed this bug exists on the latest version of pandera.
- [ ] (optional) I have confirmed this bug exists on the main branch of pandera.
Code Sample, a copy-pastable example
import pandera.pyspark as psa
import pyspark.sql as ps
from pandera.extensions import register_check_method
from pyspark.sql import types as T
@register_check_method
def custom_check(pyspark_df: ps.DataFrame):
return False
class Schema(psa.DataFrameModel):
field1: T.IntegerType() = psa.Field()
field2: T.IntegerType() = psa.Field()
class Config:
custom_check = ()
spark = ps.SparkSession.builder.appName("example").getOrCreate()
schema = T.StructType([
T.StructField("field1", T.IntegerType(), True),
T.StructField("field2", T.IntegerType(), True)])
data = [(1, 2)]
df = spark.createDataFrame(data, schema)
Schema.validate(df)
Expected behavior
Validation should fail, and raise a SchemaError (or SchemaErrors?) but not an AttributeError
Desktop (please complete the following information):
- OS: macos
- Browser: NA
- Version: pandera 0.19.3 (also exists on 0.19.0)
I just stumbled upon the same issue. Glad someone already reported it and hopefully, it can be fixed soon...