pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Pyspark validation doesn't validate joint uniqueness

Open andy-baldwin-kc opened this issue 1 year ago • 3 comments

Pyspark DataFrameSchema allows unique parameter but doesn't actually validate uniqueness.

  • [ x ] I have checked that this issue has not already been reported.
  • [ x ] I have confirmed this bug exists on the latest version of pandera.
  • [ x ] (optional) I have confirmed this bug exists on the master branch of pandera.

Code Sample

import pandera.pyspark as pa
import pyspark.sql.types as T

schema = pa.DataFrameSchema(
    columns={col: pa.Column(int) for col in ["a", "b", "c"]},
    unique=["a", "c"],
)

data = [
    (1, 2, 3),
    (1, 2, 3)
]

spark_schema = T.StructType(
    [
        T.StructField("a", T.IntegerType(), False),
        T.StructField("b", T.IntegerType(), False),
        T.StructField("c", T.IntegerType(), False),
    ],
)
df = spark.createDataFrame(data, spark_schema)
df_out = schema.validate(check_obj=df)
df_out.pandera.errors

Expected behavior

The errors object within the pandera attribute should contain an error for the failed joint uniqueness validation, mirroring the error thrown by the equivalent pandas code as documented here.

Additional context

I'm also open to temporary workarounds, registering custom checks at the dataframe (rather than single column) level also generates errors for the pyspark pandera backend.

andy-baldwin-kc avatar Aug 02 '23 21:08 andy-baldwin-kc

@NeerajMalhotra-QB @jaskaransinghsidana FYI

cosmicBboy avatar Aug 03 '23 13:08 cosmicBboy

If I recall, we had disabled it to avoid performance issues on large datasets but sure it can be added if anyone wants it but be mindful that it will be a full item by item scan.

NeerajMalhotra-QB avatar Aug 03 '23 15:08 NeerajMalhotra-QB

It would be a good feature to have, especially for smaller dataframes!

Neele22 avatar Oct 05 '23 11:10 Neele22