eclipse-collections icon indicating copy to clipboard operation
eclipse-collections copied to clipboard

min and similar operations are inefficient on a lazy view of a Bag

Open victornoel opened this issue 6 years ago • 13 comments

Hi,

I'm using this kind of code:

Bags.mutable.of("a", "b", "b", "c", "c").asLazy().select(s -> !s.isEmpty()).minBy(String::length);

And the problem is that when the Bag is viewed as a LazyIterable, then minBy relies on each which itself relies of forEachKeyValue of the encapsulated ObjectIntHashMap to execute the procedure for every duplicates of every values present in the bag.

In the case of min and other operations that are side-effect free, this is very inefficient.

I do need to use asLazy because of the select to avoid allocating a whole new collection just to iterate on it.

edit: clarified things

victornoel avatar Jan 08 '19 17:01 victornoel

It is really all the iteration patterns on the Bag which delegates to forEachKeyValue(). There are open issues which are raised for that, as the iteration should be once per key and not per keyValue. https://github.com/eclipse/eclipse-collections/issues/420

nikhilnanivadekar avatar Jan 08 '19 17:01 nikhilnanivadekar

@nikhilnanivadekar #420 seems to say that HashBag is not concerned by the problem, while the present issue is about HashBag. I couldn't find any issue about this :/

victornoel avatar Jan 08 '19 17:01 victornoel

Sorry my description didn't well explain the problem, I've edited it to clarify things :)

victornoel avatar Jan 08 '19 17:01 victornoel

@nikhilnanivadekar what do you think about my clarification and above comment? If it's relevant, would you have an idea as how to tackle this problem in term of design? I could try to solve it :)

victornoel avatar Mar 11 '19 20:03 victornoel

The red pill option would be to have asLazy() return a LazyBagIterable for Bag. LazyBagIterable would extend LazyIterable. LazyBagIterable select method would also have to return LazyBagIterable.

Warning: Not sure how much work this would result in.

donraab avatar Mar 18 '19 23:03 donraab

Specialized lazy iterables are a great idea and we should add them, but they aren’t necessary for this optimization. We’re probably just missing an override of minBy. On Mon, Mar 18, 2019 at 7:57 PM Donald Raab [email protected] wrote:

The red pill option would be to have asLazy() return a LazyBagIterable for Bag. LazyBagIterable would extend LazyIterable. LazyBagIterable select method would also have to return LazyBagIterable.

Warning: Not sure how much work this would result in.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/eclipse/eclipse-collections/issues/686#issuecomment-474146160, or mute the thread https://github.com/notifications/unsubscribe-auth/AAO6IkCdSiNT9Vbr2EbWar7HdB0RiR-mks5vYCfZgaJpZM4Z15Gt .

motlin avatar Mar 19 '19 22:03 motlin

@motlin I'm not clear how overriding minBy (and others) in LazyIterableAdapter (or superclass) is going to solve the problem. It is not aware of the particularity of Bag and only knows an Iterable so it have each (which delegates to Iterate.forEach) available to do its work...

victornoel avatar Mar 24 '19 13:03 victornoel

This issue is tricky. I'll share what I've learned so far. First, you're right of course, bag.asLazy().someMethod() iterates over each occurrence for most methods, which is inefficient.

AbstractBag has lots of manual optimizations like if (target instanceof MutableBagIterable<?>). I thought we had similar optimizations for lazy iterables but I guess I remembered wrong.

We have some iteration order tests too, but all using lists. I put together the test below to show that the RichIterable_iterationOrder test does in fact fail, and to debug it. I traced it to CollectIterable.each() which just delegates to forEach(). I think that for every implementation of LazyIterable, we may need to add the manual optimization for Bags.

@RunWith(Java8Runner.class)
public class CollectIterableBagNoIteratorTest implements NoIteratorTestCase, RichIterableWithDuplicatesTestCase, TransformsToBagTrait, UnorderedIterableTestCase
{
    public MutableCollection<Integer> expectedIterationOrder()
    {
        MutableCollection<Integer> forEach = this.newMutableForFilter();
        this.getInstanceUnderTest().asLazy().distinct().forEach(forEach::add);
        return forEach;
    }

    @Override
    public <T> LazyIterable<T> newWith(T... elements)
    {
        MutableBag<T> bag = new HashBagNoIteratorTest.HashBagNoIterator<>();
        IterableTestCase.addAllTo(elements, bag);
        return new CollectIterable<>(bag, Functions.identity());
    }

    @Override
    public <T> Bag<T> getExpectedFiltered(T... elements)
    {
        return Bags.immutable.with(elements);
    }

    @Override
    public <T> MutableBag<T> newMutableForFilter(T... elements)
    {
        return Bags.mutable.with(elements);
    }

    @Override
    public void Iterable_remove()
    {
        throw new UnsupportedOperationException(this.getClass().getSimpleName() + ".Iterable_remove() not implemented yet");
    }

    @Override
    public void Iterable_next()
    {
        throw new UnsupportedOperationException(this.getClass().getSimpleName() + ".Iterable_next() not implemented yet");
    }

    @Override
    public void Iterable_hasNext()
    {
        throw new UnsupportedOperationException(this.getClass().getSimpleName() + ".Iterable_hasNext() not implemented yet");
    }

    @Override
    public void RichIterable_getFirst()
    {
        throw new UnsupportedOperationException(this.getClass().getSimpleName() + ".RichIterable_getFirst() not implemented yet");
    }

    @Override
    public void RichIterable_getLast()
    {
        throw new UnsupportedOperationException(this.getClass().getSimpleName() + ".RichIterable_getLast() not implemented yet");
    }
}

motlin avatar Mar 27 '19 12:03 motlin

@motlin I was thinking: if there was a method distinct (or distinctView maybe) available on RichIterable, it could make sense that all operations that are expected to be side-effect free (such as min, but also why not the short-circuiting one like anySatisfy and co) should rely on it.

In the case of Bag, this would be implemented naturally, and for other collections, depending on the implementation, there could be more or less optimized versions. I thought of this when I opened #726.

This is not a bad idea since LazyIterable itself do have a distinct method for example.

An alternative could be that calling distinct on a LazyIterable of a Bag would result in this behaviour instead (with the help of a manual instanceof optimization of course). The same could be done for Set, etc. That being said, I still feel like having distinct directly on the collection and not the LazyIterable seems more elegant…

WDYT?

victornoel avatar May 29 '19 16:05 victornoel

If you haven't yet, take a look at RichIterableTestCase#RichIterable_iterationOrder and the subclasses HashBagNoIteratorTest.

I mention it because distinct views are fine, but they shouldn't be necessary. You identified a performance bug that we ought to just fix.

On Wed, May 29, 2019 at 12:24 PM Victor Noël [email protected] wrote:

@motlin https://github.com/motlin I was thinking: if there was a method distinct (or distinctView maybe) available on RichIterable, it could make sense that all operations that are expected to be side-effect free (such as min, but also why not the short-circuiting one like anySatisfy and co) should rely on it.

In the case of Bag, this would be implemented naturally, and for other collections, depending on the implementation, there could be more or less optimized versions. I thought of this when I opened #726 https://github.com/eclipse/eclipse-collections/issues/726.

This is not a bad idea since LazyIterable itself do have a distinct method for example.

An alternative could be that calling distinct on a LazyIterable of a Bag would result in this behaviour instead (with the help of a manual instanceof optimization of course). The same could be done for Set, etc. That being said, I still feel like having distinct directly on the collection and not the LazyIterable seems more elegant…

WDYT?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/eclipse/eclipse-collections/issues/686?email_source=notifications&email_token=AAB3UIQVP7CZEZTRDIDEMLTPX2U5PA5CNFSM4GOXSGW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWP4GLA#issuecomment-497009452, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB3UIUPY56AEVSQLFDHKITPX2U5PANCNFSM4GOXSGWQ .

motlin avatar Jun 02 '19 23:06 motlin

@motlin I'm not sure what I should see in those classes? Do you mean the commented code that refer to the same bug as the one of this issue?

I understand it is a bug, my proposal about distinct views was a way to elegantly implement a solution to this bug, not as a workaround: if there was a way to always get a distinct view from a collection, then you could have generic-defined-once implementations for the side-effect free operations that relies on it.

I made this proposal because, from what I understand, with the current design, it's not going to be straightforward to fix this bug except by introducing a bunch of classes for lazy bags.

victornoel avatar Jun 03 '19 06:06 victornoel

I haven't had time to look into it but I feel like there's a middle ground that shouldn't be too hard. And the distinct() view kind of already exists... it's asLazy().distinct(). It may be fine to delegate to that method inside a lazy iterable's min/max method as a short-term fix.

On Mon, Jun 3, 2019 at 2:52 AM Victor Noël [email protected] wrote:

@motlin https://github.com/motlin I'm not sure what I should see in those classes? Do you mean the commented code that refer to the same bug as the one of this issue?

I understand it is a bug, my proposal about distinct views was a way to elegantly implement a solution to this bug, not as a workaround: if there was a way to always get a distinct view from a collection, then you could have generic-defined-once implementations for the side-effect free operations that relies on it.

I made this proposal because, from what I understand, with the current design, it's not going to be straightforward to fix this bug except by introducing a bunch of classes for lazy bags.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/eclipse/eclipse-collections/issues/686?email_source=notifications&email_token=AAB3UIUMA6WLPSP3KE3ZHA3PYS5RVA5CNFSM4GOXSGW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWYO4OY#issuecomment-498134587, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB3UIUM2ZYAUNDMQYWTINLPYS5RVANCNFSM4GOXSGWQ .

motlin avatar Jun 05 '19 13:06 motlin

@motlin for asLazy().distinct(), the only way to make it be useful, would be to have specific implementation of DistinctIterable for Sets and Bags for example. If not you won't solve the performance problem since DistinctIterable would be highly non efficient.

victornoel avatar Jun 05 '19 13:06 victornoel