deequ icon indicating copy to clipboard operation
deequ copied to clipboard

Enhancement Request: Histogram with aggregate on a second column

Open rpasupat opened this issue 4 years ago • 4 comments

We have a requirement to store an aggregate (sum for example) of a "second" column along with the histogram counts. Would like to know if such a requirement is already being considered.

I want to analyse the trends based on the second column's aggregate (while picking the top N entries from the histogram).

I imagine the implementation being the addition of a new metric case class and a new analyser which does a second column aggregate (in addition to count) and filling the metric (as part of compute).

Suggestions on using current features to achieve this is much appreciated.

rpasupat avatar Sep 07 '21 10:09 rpasupat

Just to clarify: You want to track an aggregate metric for each of the historam bins, is that right? So, this would be logically similar to e.g. data.groupBy("firstColumn").agg(count("*"), sum("secondColumn")).

I am not sure how this could be achieved with the current API. @lange-labs @tdhd @TammoR Can you weigh in?

twollnik avatar Sep 07 '21 15:09 twollnik

Yes exactly. Thanks very much @twollnik

Eagerly look forward to a solution from deequ experts here. Either a solution/workaround using existing APIs or this being accepted as a good requirement (but can't achieve using existing features). It will help me confirm and make a decision on my requirement.

rpasupat avatar Sep 08 '21 12:09 rpasupat

One idea for a workaround would be to calculate the aggregates using regular spark, e.g. df.groupBy("firstColumn").agg(sum("secondColumn")). Then, you could associate the output of this aggregation with the deequ results based on the histogram bin. Would this workaround satisfy your use-case?

twollnik avatar Oct 19 '21 08:10 twollnik

@twollnik Thanks for your response. Appreciate it.

If I understand it right, you are suggesting to store details on the aggregate (e.g. Sum) separately and then compare it with the histogram bins. Is that right?

We store a set of histogram metrics in the metrics repository and later for new data compare it what is in history. Doing it custom would mean, storing similar entries in a different file/database for sum and then compare it with what we retrieve from metrics repo. If we go the custom route (direct spark route I mean), I would rather store both "count" and "sum" side by side making my job a lot easier.

I will read through the other mentioned tickets here.

rpasupat avatar Nov 02 '21 05:11 rpasupat