eclipse-collections
eclipse-collections copied to clipboard
min and similar operations are inefficient on a lazy view of a Bag
Hi,
I'm using this kind of code:
Bags.mutable.of("a", "b", "b", "c", "c").asLazy().select(s -> !s.isEmpty()).minBy(String::length);
And the problem is that when the Bag is viewed as a LazyIterable, then minBy relies on each which itself relies of forEachKeyValue of the encapsulated ObjectIntHashMap to execute the procedure for every duplicates of every values present in the bag.
In the case of min and other operations that are side-effect free, this is very inefficient.
I do need to use asLazy because of the select to avoid allocating a whole new collection just to iterate on it.
edit: clarified things
It is really all the iteration patterns on the Bag which delegates to forEachKeyValue(). There are open issues which are raised for that, as the iteration should be once per key and not per keyValue. https://github.com/eclipse/eclipse-collections/issues/420
@nikhilnanivadekar #420 seems to say that HashBag is not concerned by the problem, while the present issue is about HashBag. I couldn't find any issue about this :/
Sorry my description didn't well explain the problem, I've edited it to clarify things :)
@nikhilnanivadekar what do you think about my clarification and above comment? If it's relevant, would you have an idea as how to tackle this problem in term of design? I could try to solve it :)
The red pill option would be to have asLazy() return a LazyBagIterable for Bag. LazyBagIterable would extend LazyIterable. LazyBagIterable select method would also have to return LazyBagIterable.
Warning: Not sure how much work this would result in.
Specialized lazy iterables are a great idea and we should add them, but they aren’t necessary for this optimization. We’re probably just missing an override of minBy. On Mon, Mar 18, 2019 at 7:57 PM Donald Raab [email protected] wrote:
The red pill option would be to have asLazy() return a LazyBagIterable for Bag. LazyBagIterable would extend LazyIterable. LazyBagIterable select method would also have to return LazyBagIterable.
Warning: Not sure how much work this would result in.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/eclipse/eclipse-collections/issues/686#issuecomment-474146160, or mute the thread https://github.com/notifications/unsubscribe-auth/AAO6IkCdSiNT9Vbr2EbWar7HdB0RiR-mks5vYCfZgaJpZM4Z15Gt .
@motlin I'm not clear how overriding minBy (and others) in LazyIterableAdapter (or superclass) is going to solve the problem. It is not aware of the particularity of Bag and only knows an Iterable so it have each (which delegates to Iterate.forEach) available to do its work...
This issue is tricky. I'll share what I've learned so far. First, you're right of course, bag.asLazy().someMethod() iterates over each occurrence for most methods, which is inefficient.
AbstractBag has lots of manual optimizations like if (target instanceof MutableBagIterable<?>). I thought we had similar optimizations for lazy iterables but I guess I remembered wrong.
We have some iteration order tests too, but all using lists. I put together the test below to show that the RichIterable_iterationOrder test does in fact fail, and to debug it. I traced it to CollectIterable.each() which just delegates to forEach(). I think that for every implementation of LazyIterable, we may need to add the manual optimization for Bags.
@RunWith(Java8Runner.class)
public class CollectIterableBagNoIteratorTest implements NoIteratorTestCase, RichIterableWithDuplicatesTestCase, TransformsToBagTrait, UnorderedIterableTestCase
{
public MutableCollection<Integer> expectedIterationOrder()
{
MutableCollection<Integer> forEach = this.newMutableForFilter();
this.getInstanceUnderTest().asLazy().distinct().forEach(forEach::add);
return forEach;
}
@Override
public <T> LazyIterable<T> newWith(T... elements)
{
MutableBag<T> bag = new HashBagNoIteratorTest.HashBagNoIterator<>();
IterableTestCase.addAllTo(elements, bag);
return new CollectIterable<>(bag, Functions.identity());
}
@Override
public <T> Bag<T> getExpectedFiltered(T... elements)
{
return Bags.immutable.with(elements);
}
@Override
public <T> MutableBag<T> newMutableForFilter(T... elements)
{
return Bags.mutable.with(elements);
}
@Override
public void Iterable_remove()
{
throw new UnsupportedOperationException(this.getClass().getSimpleName() + ".Iterable_remove() not implemented yet");
}
@Override
public void Iterable_next()
{
throw new UnsupportedOperationException(this.getClass().getSimpleName() + ".Iterable_next() not implemented yet");
}
@Override
public void Iterable_hasNext()
{
throw new UnsupportedOperationException(this.getClass().getSimpleName() + ".Iterable_hasNext() not implemented yet");
}
@Override
public void RichIterable_getFirst()
{
throw new UnsupportedOperationException(this.getClass().getSimpleName() + ".RichIterable_getFirst() not implemented yet");
}
@Override
public void RichIterable_getLast()
{
throw new UnsupportedOperationException(this.getClass().getSimpleName() + ".RichIterable_getLast() not implemented yet");
}
}
@motlin I was thinking: if there was a method distinct (or distinctView maybe) available on RichIterable, it could make sense that all operations that are expected to be side-effect free (such as min, but also why not the short-circuiting one like anySatisfy and co) should rely on it.
In the case of Bag, this would be implemented naturally, and for other collections, depending on the implementation, there could be more or less optimized versions. I thought of this when I opened #726.
This is not a bad idea since LazyIterable itself do have a distinct method for example.
An alternative could be that calling distinct on a LazyIterable of a Bag would result in this behaviour instead (with the help of a manual instanceof optimization of course). The same could be done for Set, etc.
That being said, I still feel like having distinct directly on the collection and not the LazyIterable seems more elegant…
WDYT?
If you haven't yet, take a look at RichIterableTestCase#RichIterable_iterationOrder and the subclasses HashBagNoIteratorTest.
I mention it because distinct views are fine, but they shouldn't be necessary. You identified a performance bug that we ought to just fix.
On Wed, May 29, 2019 at 12:24 PM Victor Noël [email protected] wrote:
@motlin https://github.com/motlin I was thinking: if there was a method distinct (or distinctView maybe) available on RichIterable, it could make sense that all operations that are expected to be side-effect free (such as min, but also why not the short-circuiting one like anySatisfy and co) should rely on it.
In the case of Bag, this would be implemented naturally, and for other collections, depending on the implementation, there could be more or less optimized versions. I thought of this when I opened #726 https://github.com/eclipse/eclipse-collections/issues/726.
This is not a bad idea since LazyIterable itself do have a distinct method for example.
An alternative could be that calling distinct on a LazyIterable of a Bag would result in this behaviour instead (with the help of a manual instanceof optimization of course). The same could be done for Set, etc. That being said, I still feel like having distinct directly on the collection and not the LazyIterable seems more elegant…
WDYT?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/eclipse/eclipse-collections/issues/686?email_source=notifications&email_token=AAB3UIQVP7CZEZTRDIDEMLTPX2U5PA5CNFSM4GOXSGW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWP4GLA#issuecomment-497009452, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB3UIUPY56AEVSQLFDHKITPX2U5PANCNFSM4GOXSGWQ .
@motlin I'm not sure what I should see in those classes? Do you mean the commented code that refer to the same bug as the one of this issue?
I understand it is a bug, my proposal about distinct views was a way to elegantly implement a solution to this bug, not as a workaround: if there was a way to always get a distinct view from a collection, then you could have generic-defined-once implementations for the side-effect free operations that relies on it.
I made this proposal because, from what I understand, with the current design, it's not going to be straightforward to fix this bug except by introducing a bunch of classes for lazy bags.
I haven't had time to look into it but I feel like there's a middle ground that shouldn't be too hard. And the distinct() view kind of already exists... it's asLazy().distinct(). It may be fine to delegate to that method inside a lazy iterable's min/max method as a short-term fix.
On Mon, Jun 3, 2019 at 2:52 AM Victor Noël [email protected] wrote:
@motlin https://github.com/motlin I'm not sure what I should see in those classes? Do you mean the commented code that refer to the same bug as the one of this issue?
I understand it is a bug, my proposal about distinct views was a way to elegantly implement a solution to this bug, not as a workaround: if there was a way to always get a distinct view from a collection, then you could have generic-defined-once implementations for the side-effect free operations that relies on it.
I made this proposal because, from what I understand, with the current design, it's not going to be straightforward to fix this bug except by introducing a bunch of classes for lazy bags.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/eclipse/eclipse-collections/issues/686?email_source=notifications&email_token=AAB3UIUMA6WLPSP3KE3ZHA3PYS5RVA5CNFSM4GOXSGW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWYO4OY#issuecomment-498134587, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB3UIUM2ZYAUNDMQYWTINLPYS5RVANCNFSM4GOXSGWQ .
@motlin for asLazy().distinct(), the only way to make it be useful, would be to have specific implementation of DistinctIterable for Sets and Bags for example. If not you won't solve the performance problem since DistinctIterable would be highly non efficient.