deequ
deequ copied to clipboard
Usage of window functions inside checks
Hello,
I would like to test a constraint on my dataset where an item id is associated with multiple timestamp ranges (valid_from, valid_to).
The purpose of this test is to check if there are time gaps within subsequent ranges, so I am doing the following:
val noTimeGapsConstraint = "valid_to IS NULL OR valid_to = lead(valid_from, 1) over (partition by item_id order by valid_from)"
Check(CheckLevel.Error, "").satisfies(noTimeGapsConstraint, "No Time Gaps Constraint")
But Deequ is complaining about the fact that I am using a window function inside an aggregation function.
org.apache.spark.sql.AnalysisException: It is not allowed to use a window function inside an aggregate function. Please use the inner window function in a sub-query.;
What would be the proper way to test such constraint?
Hi,
This error message is coming from Spark not Deequ. I am afraid that you cannot use windowing functions with deequ at the moment, unfortunately.
Thanks @sscdotopen,
I know the error comes from Spark, presumably because Deequ runs these checks on an aggregated version of the data instead of the full dataset?
Is this something that could be added in the future? Despite such checks being quite strict/expensive, I can see several use cases where they could be really useful.
Deequ generates efficient aggregation queries to compute the statistics required for the checks. We don't have immediate plans to integrate windowing functions, but we would be happy to receive a contribution that adds such functionality.