itertools icon indicating copy to clipboard operation
itertools copied to clipboard

Initial working example of lending iterator for combinations.

Open Easyoakland opened this issue 1 year ago • 5 comments

I was working on another project and realized that 30% of the runtime was the itertools .combinations() adapter allocating vectors on the heap.

Added an implementation for a lending iterator that is able to avoid heap allocations with a lending iterator and completely eliminated the heap allocation speed penalties. It called the adapter combinations_lending. I also added an optional feature that can be enabled to utilize this and maybe in the future other lending iterators.

Added benchmarks comparing .combinations() and .combinations_lending(). They show the lending implementation is ~1x-7x speed depending on use case. I did not measure a noticeable speed penalty in the benchmarks so a future change could theoretically implement the normal .combinations() iterator in terms of the lending iterator's .into_iter() method.

Feature allows for conditional compilation so those who don't want a lending iterator don't need to compile the related dependencies.

More info in the source code documentation.

Let me know what you think!

Easyoakland avatar Mar 10 '23 03:03 Easyoakland

I saw that the PR contains quite some formatting/documentation changes. You can surely include them, but it will simplify the final reviewer's life when they're in a separate commit.

I'm not sure I understand this point. Are you saying I should have separated out the comments and documentation additions to the new code from the new code being commented on?

I am always wary when it comes to including another crate - even if it is behind a feature flag, it is code to properly maintain and test.

That is the why there are automated tests, no? If in a worst case situation all the code relying on the dependency broke the remaining features would still work. It seems like there is not cost beyond the bug risk inherent in any code written.

If you are looking for maximum performance, I am not sure whether the "manipulate-indices and collect indexed items"-approach is the best.

I also do not know if the implementation is the best. I did not change the logic of how combinations are computed. Though it seems worth noting that the indexed items are not being collected as a group in the LendingIterator implementation. That's the point of the new lending implementation. An iterator is returned which returns individual copies instead of a collected Vec.

It may be, but it may also be faster to manipulate items (or references to items) instead. (E.g. if you really just want combinations from the Vec containing 0,1,2,3 you could simply work on the numbers directly.) Did you happen to research that? Does anybody reading this have relevant experience?

My thoughts on this are:

  • The combinations adapter appears generic over items that it is iterating with the only restriction being that the items implement Clone. The vec![0,1,2,3] situation of iterating over items directly would only work if this was limited to Copy items and I'm not sure it would be faster.
  • Is the current implementation of manipulating indices (usize numbers) really different from manipulating references somehow? How would references be manipulated? Aren't references just pointers which are themselves usize?
  • I don't fully understand LazyBuffer so I don't know if this could be changed to contain references instead of actual items. I also don't know if you would want this extra layer of indirection.
  • If the items of a collection are too unwieldy to clone directly (or don't implement clone at all) one can do collection.iter().combinations_lending(k) to make the items of iteration references, and cloning references is as cheap as it gets. If the collection is of elements that can be easily cloned then collection.into_iter().combinations_lending(k) can be used instead to avoid an unnecessary level of indirection. Best of both worlds.

The below code was a check I used to confirm that .iter() will result in clones of references. Notice how the function that requires a cloneable item receives a reference to a non-clonable struct and compiles fine.

fn need_cloneable<T: Clone>(y: T) -> T {
    y.clone()
}

#[derive(Debug)]
struct NotCloneableStruct(usize, Vec<usize>);

fn main() {
    let c = vec![
        NotCloneableStruct(1, vec![1, 1]),
        NotCloneableStruct(2, vec![2, 2]),
        NotCloneableStruct(3, vec![3, 3]),
        NotCloneableStruct(4, vec![4, 4]),
    ];

    c.iter()
        .map(|x: &NotCloneableStruct| {
            dbg!(&x);
            dbg!(x.clone());
            dbg!(x);
            dbg!(need_cloneable(x)); // All references implement Clone because of core's impl<T> Clone for &T.
        })
        .for_each(drop);
}

As per the above, merging this PR might be a good idea, but before that we should really find out what we want to offer. Right now the raison d'être of combinations seems to be convenience (at the price of some performance penalty). If we want to have its efficient counterpart, I would strive to make it as efficient as possible (possibly even sacrificing some convenience). I would vote against having something in between.

If I saw a way to further sacrifice convenience for performance I would have done so. I don't see how this would be done.

Easyoakland avatar Mar 22 '23 00:03 Easyoakland

I'm not sure I understand this point. Are you saying I should have separated out the comments and documentation additions to the new code from the new code being commented on?

No, I'm talking about things as https://github.com/rust-itertools/itertools/pull/682/files#diff-01a3ceaed8205eb8f50a269c77b7888890da04e982c9c793943743a1ee4471c3L18-R27 or https://github.com/rust-itertools/itertools/pull/682/files#diff-01a3ceaed8205eb8f50a269c77b7888890da04e982c9c793943743a1ee4471c3L48-R62 or https://github.com/rust-itertools/itertools/pull/682/files#diff-01a3ceaed8205eb8f50a269c77b7888890da04e982c9c793943743a1ee4471c3L53-R108.

I also do not know if the implementation is the best. I did not change the logic of how combinations are computed. Though it seems worth noting that the indexed items are not being collected as a group in the LendingIterator implementation. That's the point of the new lending implementation. An iterator is returned which returns individual copies instead of a collected Vec.

I saw that you do not collect into a Vec - I like that.

  • Is the current implementation of manipulating indices (usize numbers) really different from manipulating references somehow? How would references be manipulated? Aren't references just pointers which are themselves usize?

With indices you'll have to look up the item from the buffer. That is, you essentially have to add the index to the begin-pointer of the buffer. Handling references directly could skip this addition, thus possibly save one indirection. (As always, compilers may be clever enough to optimize some things. And - on top of that - I did not benchmark; it just came to my mind.)

If I saw a way to further sacrifice convenience for performance I would have done so. I don't see how this would be done.

I could imagine something like fn combinations_inplace(items: &mut [Item]) -> CombinationsInplace where CombinationsInplace works directly on the slice items - without using indices. I.e. CombinationsInplace would repeatedly call the combinations-counterpart of something like C++'s std::next_permutation, and yield a slice of the first k elements of items. This way, you'd force the caller to first create a mutable slice of items (read: less convenience), but could possibly exploit clever techniques to make it more efficient.

phimuemue avatar Mar 22 '23 10:03 phimuemue

I think this needs a strong reason that these adapters shouldn't just be in the lending-iterator crate directly.

My instinct is that itertools itself shouldn't provide any lending iterators until there's a lending iterator trait in core.

scottmcm avatar Mar 22 '23 19:03 scottmcm

@scottmcm @jswrenn The reason I though to implement this functionality here is that the actual logic of the lending iterators is almost identical. The new feature uses the same logic and the state is held in the same struct. It would be a shame if this would have to be implemented in @danielhenrymantilla 's lending-iterator crate because that would require copy-pasting most of itertools into lending-iterator for all the iterators from itertools that have an effective LendingIterator implementation. Multiple iterators in itertools can benefit from the same lending re-implementation while not changing the actual logic of the iterators.

Additionally the into_iterator definition can only be as simple as this because the iterator and lending_iterator implementations use the same struct and logic with a flag to indicate which is in use. If this wasn't implemented in this crate then the PhantomData trick to make usage of either implementation simple and unambiguous would not work.

It seems unlikely to me that the best solution to this issue would be to have two of every iterator in two different crates with the exact same struct and nearly identical methods, but I don't see how else to implement this if it is merged in a different crate.

Easyoakland avatar Mar 23 '23 00:03 Easyoakland

@phimuemue

I'm not sure I understand this point. Are you saying I should have separated out the comments and documentation additions to the new code from the new code being commented on?

No, I'm talking about things as https://github.com/rust-itertools/itertools/pull/682/files#diff-01a3ceaed8205eb8f50a269c77b7888890da04e982c9c793943743a1ee4471c3L18-R27 or https://github.com/rust-itertools/itertools/pull/682/files#diff-01a3ceaed8205eb8f50a269c77b7888890da04e982c9c793943743a1ee4471c3L48-R62 or https://github.com/rust-itertools/itertools/pull/682/files#diff-01a3ceaed8205eb8f50a269c77b7888890da04e982c9c793943743a1ee4471c3L53-R108.

Oh I understand now. That looks like rustfmt. I forgot I had it set to auto format on save. I will endeavor to format in a separate commit for the future. Thank you.

I also do not know if the implementation is the best. I did not change the logic of how combinations are computed. Though it seems worth noting that the indexed items are not being collected as a group in the LendingIterator implementation. That's the point of the new lending implementation. An iterator is returned which returns individual copies instead of a collected Vec.

I saw that you do not collect into a Vec - I like that.

  • Is the current implementation of manipulating indices (usize numbers) really different from manipulating references somehow? How would references be manipulated? Aren't references just pointers which are themselves usize?

With indices you'll have to look up the item from the buffer. That is, you essentially have to add the index to the begin-pointer of the buffer. Handling references directly could skip this addition, thus possibly save one indirection. (As always, compilers may be clever enough to optimize some things. And - on top of that - I did not benchmark; it just came to my mind.)

How would one handle references directly? Wouldn't you need to store in an array/buffer/vec the references that were being manipulated anyway? I could very well be misunderstanding but it seems like the only difference between what you describe and what is implemented is that the current implementation will handle references if the iterator is over references and actual values if the iterator is over values. Rather than always references of what is being iterated over. Is the current implementation not preferable?

If I saw a way to further sacrifice convenience for performance I would have done so. I don't see how this would be done.

I could imagine something like fn combinations_inplace(items: &mut [Item]) -> CombinationsInplace where CombinationsInplace works directly on the slice items - without using indices. I.e. CombinationsInplace would repeatedly call the combinations-counterpart of something like C++'s std::next_permutation, and yield a slice of the first k elements of items. This way, you'd force the caller to first create a mutable slice of items (read: less convenience), but could possibly exploit clever techniques to make it more efficient.

The c++ implementation for permutations can use the fact that they define the ordering to be the "set of all permutations is ordered lexicographically with respect to operator<" to determine the next permutation from the current. What would the definition of the combinations ordering be? Also, extrapolating what the next combination should be by looking through the current combination seems like it would be slower than keeping track of which combination is next in the iterator's state as is currently the case.

Easyoakland avatar Mar 23 '23 01:03 Easyoakland