pandera
pandera copied to clipboard
Pyspark validation doesn't validate joint uniqueness
Pyspark DataFrameSchema allows unique
parameter but doesn't actually validate uniqueness.
- [ x ] I have checked that this issue has not already been reported.
- [ x ] I have confirmed this bug exists on the latest version of pandera.
- [ x ] (optional) I have confirmed this bug exists on the master branch of pandera.
Code Sample
import pandera.pyspark as pa
import pyspark.sql.types as T
schema = pa.DataFrameSchema(
columns={col: pa.Column(int) for col in ["a", "b", "c"]},
unique=["a", "c"],
)
data = [
(1, 2, 3),
(1, 2, 3)
]
spark_schema = T.StructType(
[
T.StructField("a", T.IntegerType(), False),
T.StructField("b", T.IntegerType(), False),
T.StructField("c", T.IntegerType(), False),
],
)
df = spark.createDataFrame(data, spark_schema)
df_out = schema.validate(check_obj=df)
df_out.pandera.errors
Expected behavior
The errors object within the pandera attribute should contain an error for the failed joint uniqueness validation, mirroring the error thrown by the equivalent pandas code as documented here.
Additional context
I'm also open to temporary workarounds, registering custom checks at the dataframe (rather than single column) level also generates errors for the pyspark pandera backend.
@NeerajMalhotra-QB @jaskaransinghsidana FYI
If I recall, we had disabled it to avoid performance issues on large datasets but sure it can be added if anyone wants it but be mindful that it will be a full item by item scan.
It would be a good feature to have, especially for smaller dataframes!