deequ icon indicating copy to clipboard operation
deequ copied to clipboard

Usage of window functions inside checks

Open mng1dev opened this issue 4 years ago • 3 comments

Hello,

I would like to test a constraint on my dataset where an item id is associated with multiple timestamp ranges (valid_from, valid_to).

The purpose of this test is to check if there are time gaps within subsequent ranges, so I am doing the following:

val noTimeGapsConstraint = "valid_to IS NULL OR valid_to = lead(valid_from, 1) over (partition by item_id order by valid_from)"

Check(CheckLevel.Error, "").satisfies(noTimeGapsConstraint, "No Time Gaps Constraint")

But Deequ is complaining about the fact that I am using a window function inside an aggregation function.

org.apache.spark.sql.AnalysisException: It is not allowed to use a window function inside an aggregate function. Please use the inner window function in a sub-query.;

What would be the proper way to test such constraint?

mng1dev avatar Aug 03 '20 09:08 mng1dev

Hi,

This error message is coming from Spark not Deequ. I am afraid that you cannot use windowing functions with deequ at the moment, unfortunately.

sscdotopen avatar Aug 04 '20 05:08 sscdotopen

Thanks @sscdotopen,

I know the error comes from Spark, presumably because Deequ runs these checks on an aggregated version of the data instead of the full dataset?

Is this something that could be added in the future? Despite such checks being quite strict/expensive, I can see several use cases where they could be really useful.

mng1dev avatar Aug 04 '20 07:08 mng1dev

Deequ generates efficient aggregation queries to compute the statistics required for the checks. We don't have immediate plans to integrate windowing functions, but we would be happy to receive a contribution that adds such functionality.

sscdotopen avatar Aug 04 '20 07:08 sscdotopen