datajudge `AggregateNumericRangeEquality`: Mismatch between implementation and documentation

The doc string of the corresponding add_*_constraint method claims the following:

Since we expect aggregate_column to be a numeric column, this leads to a multiset of aggregated values. These values should correspond to the integers ranging from start_value to the cardinality of the multiset.

Hence if we have, for a given key, n rows (in other words, the cardinality of the multiset) and a start_value of k, I would expect a range to be complete if exactly the following rows exist:

(key, k)
(key, k+1)
...
(key, k+n-1)

Yet, the implementation checks the following:

https://github.com/Quantco/datajudge/blob/035031836b20c30e4afac27067000cec83692d95/src/datajudge/constraints/groupby.py#L36-L37

On the one hand, it revolves around the maximal encountered value instead of the cardinality of the set. On the other hand, the start value is added to said maximum.

It is easy to come up with an example where both outlined behaviours diverge. Assume start_value to equal k and the observed rows to correspond to this:

(key, k)
(key, k+1)
(key, k+2)

According to the former definition - as described in the doc string - this would be a legitimate key. According to the latter definition, we would expect

(key, k)
(key, k+1)
(key, k+2)
...
(key, k+2+k)

which would flag the current key as a failure for some k.

We do not notice this diverging behaviour in our tests since our tests only use start_value=1.

What is intended behaviour?

Jul 25 '22 16:07 kklein

From what I remember, and by reading the code, the idea was to test for continuous keys of the aggregation when such keys are integers and having an offset.

Unfortunately, I didn't get exactly the issues you present here.

What you get from the aggregation are pairs (k, n_k). for example (1, 12), (2, 1), (3,100). In this case, a test would be fine if start_value=1 but it'd failt for start_value=0.

Jul 31 '22 22:07 ivergara

Unfortunately, I didn't get exactly the issues you present here.

In the example above we observe

(key, k)
(key, k+1)
(key, k+2)

The docstring says:

Since we expect aggregate_column to be a numeric column, this leads to a multiset of aggregated values. These values should correspond to the integers ranging from start_value to the cardinality of the multiset.

Hence, if start_value = k, this should be an accepted input since the cardinality is 3 and we observe values k, k + 1 and k + 2.

Yet, according to the current implementation we would not accept the input. The current implementation accepts the input only if range(start, max(values) + start) is fully covered.

Note that in the general case it doesn't hold that range(start, max(values) + start) = range(start, start + cardinality)

In the example above we have max(values) = k + 2 and start = k, implying that

max(values) + start = 2*k + 2
start + cardinality = k + 3

Nov 09 '22 11:11 kklein