deequ Enhancement Request: Histogram with aggregate on a second column

We have a requirement to store an aggregate (sum for example) of a "second" column along with the histogram counts. Would like to know if such a requirement is already being considered.

I want to analyse the trends based on the second column's aggregate (while picking the top N entries from the histogram).

I imagine the implementation being the addition of a new metric case class and a new analyser which does a second column aggregate (in addition to count) and filling the metric (as part of compute).

Suggestions on using current features to achieve this is much appreciated.

Sep 07 '21 10:09 rpasupat

Just to clarify: You want to track an aggregate metric for each of the historam bins, is that right? So, this would be logically similar to e.g. data.groupBy("firstColumn").agg(count("*"), sum("secondColumn")).

I am not sure how this could be achieved with the current API. @lange-labs @tdhd @TammoR Can you weigh in?

Sep 07 '21 15:09 twollnik

Yes exactly. Thanks very much @twollnik

Eagerly look forward to a solution from deequ experts here. Either a solution/workaround using existing APIs or this being accepted as a good requirement (but can't achieve using existing features). It will help me confirm and make a decision on my requirement.

Sep 08 '21 12:09 rpasupat

One idea for a workaround would be to calculate the aggregates using regular spark, e.g. df.groupBy("firstColumn").agg(sum("secondColumn")). Then, you could associate the output of this aggregation with the deequ results based on the histogram bin. Would this workaround satisfy your use-case?

Oct 19 '21 08:10 twollnik

@twollnik Thanks for your response. Appreciate it.

If I understand it right, you are suggesting to store details on the aggregate (e.g. Sum) separately and then compare it with the histogram bins. Is that right?

We store a set of histogram metrics in the metrics repository and later for new data compare it what is in history. Doing it custom would mean, storing similar entries in a different file/database for sum and then compare it with what we retrieve from metrics repo. If we go the custom route (direct spark route I mean), I would rather store both "count" and "sum" side by side making my job a lot easier.

I will read through the other mentioned tickets here.

Nov 02 '21 05:11 rpasupat