Enhancement Request: Histogram with aggregate on a second column
We have a requirement to store an aggregate (sum for example) of a "second" column along with the histogram counts. Would like to know if such a requirement is already being considered.
I want to analyse the trends based on the second column's aggregate (while picking the top N entries from the histogram).
I imagine the implementation being the addition of a new metric case class and a new analyser which does a second column aggregate (in addition to count) and filling the metric (as part of compute).
Suggestions on using current features to achieve this is much appreciated.
Just to clarify: You want to track an aggregate metric for each of the historam bins, is that right? So, this would be logically similar to e.g. data.groupBy("firstColumn").agg(count("*"), sum("secondColumn")).
I am not sure how this could be achieved with the current API. @lange-labs @tdhd @TammoR Can you weigh in?
Yes exactly. Thanks very much @twollnik
Eagerly look forward to a solution from deequ experts here. Either a solution/workaround using existing APIs or this being accepted as a good requirement (but can't achieve using existing features). It will help me confirm and make a decision on my requirement.
One idea for a workaround would be to calculate the aggregates using regular spark, e.g. df.groupBy("firstColumn").agg(sum("secondColumn")). Then, you could associate the output of this aggregation with the deequ results based on the histogram bin. Would this workaround satisfy your use-case?
@twollnik Thanks for your response. Appreciate it.
If I understand it right, you are suggesting to store details on the aggregate (e.g. Sum) separately and then compare it with the histogram bins. Is that right?
We store a set of histogram metrics in the metrics repository and later for new data compare it what is in history. Doing it custom would mean, storing similar entries in a different file/database for sum and then compare it with what we retrieve from metrics repo. If we go the custom route (direct spark route I mean), I would rather store both "count" and "sum" side by side making my job a lot easier.
I will read through the other mentioned tickets here.