Optimize summaryStatistics for primitive Bags
Currently, summaryStatistics uses the default implementation which does not optimize for duplicates in Bags. It currently iterates over every element in the Bag.
The method summaryStatistics should be overridden and optimized to use forEachWithOccurrences similar to how sum does.
Hi @donraab This code-generator module idea is fantastic and attractive for me, May I take a try solve this issue?
Hi @donraab
I found it's easy to calculate a sum, a max, a min and a count using Bag's properties, but hard to assemble them into a <primitive>SummaryStatistics.
Seems impossible to change it's behavior which deeply binds with foreach loop, and maybe this is the reason why they've already optimized sum(), min(), max() and count() and left summaryStatistics behind.
I'm not sure is there any other way solve it, if so, could you please give me some hints?
Hi @1ightDance, I was thinking something like this would work. The following test code works.
@Test
public void combineIntSummaryStatistics()
{
IntSummaryStatistics summary = new IntSummaryStatistics();
for (int i = 1; i < 6; i++)
{
IntSummaryStatistics occurrences = new IntSummaryStatistics(i, i, i, i);
summary.combine(occurrences);
}
Assertions.assertEquals(15L, summary.getSum());
Assertions.assertEquals(15L, summary.getCount());
Assertions.assertEquals(1, summary.getMin());
Assertions.assertEquals(5, summary.getMax());
}
Take a look at the specialized IntSummaryStatistics constructor which takes count, min, max, sum and the combine method which allows instances of stats classes to be merged. Using forEachWithOccurrences on a Bag, the count should be occurrences, the sum should be value * occurrences, min is the value, max is the value.
Hope this helps. Thanks for volunteering. I have assigned the issue to you.
Thanks for your help @donraab My code is some way refferred Binary Exponentiation like this:
@Override
public DoubleSummaryStatistics summaryStatistics()
{
DoubleSummaryStatistics result = new DoubleSummaryStatistics();
this.forEachWithOccurrences((double each, int occurrences) ->
{
DoubleSummaryStatistics total = new DoubleSummaryStatistics();
DoubleSummaryStatistics temp = new DoubleSummaryStatistics();
temp.accept(each);
for (int i = occurrences; i > 0; i /= 2)
{
if (i % 2 == 1)
{
total.combine(temp);
}
temp.combine(temp);
}
result.combine(total);
});
return result;
}
and I did some tests to compare optimization with it's previous code:
previous DoubleHashBag of 5000000 elements cost 111ms. (1.0, 10.0)
optimized DoubleHashBag of 5000000 elements cost 171ms. (1.0, 10.0)
previous IntHashBag of 30000000 elements cost 27ms. (1, 1000)
optimized IntHashBag of 30000000 elements cost 2ms. (1, 1000)
It do improve a lot for short, int, long etc. But as for double & float, every element only occurred once, my code insteadly slowed it down by adding additionally invoke of DoubleSummaryStatistics's constuctor.
Based on performance considertion, it seems we should make them different and only optimize Bags of non-floating primitive, but I'm afraid this will hurt consistences of Bags
Hi @1ightDance, taking a look at IntSummaryStatistics, I didn't realize the new additional constructors for Int/Long/DoubleSummaryStatistics were added in JDK 10. Let's hold off on implementing this until we upgrade Eclipse Collections to JDK 11.
@nikhilnanivadekar @prathasirisha @motlin @mohrezaei I can create the ticket for upgrading EC build to Java 11. This issue will not be able to be completed until we upgrade.