itertools
itertools copied to clipboard
Added `set_size_hint` method
Sometimes there is not enough information available for adapters to correctly deduce a useful size hint. However, a developer might have more information and this new iterator method allows them to express it.
A motivating example from the code:
let data = vec![1, 2, 6, 7, 2];
let result: Vec<i32> = data.iter()
// The `FlatMap` adapter is not able to deduce the size hint itself
// so `size_hint()` would return `(0, None)`.
.flat_map(|&x| {
let repeats = if x % 2 == 0 { 1 } else { 3 };
itertools::repeat_n(x, repeats)
})
// But we know the bounds on the max and min number of items in the
// resulting iterator, so we can provide that information:
.set_size_hint(data.len(), Some(3 * data.len()))
// The `Vec` should not excessively re-allocate, while collecting
// since it now knows the minimum and maximum number of items.
.collect();
I am not 100% happy with the method name, set_size_hint, and there are alternatives in which only one of the bounds is set at a time.
Thank you for your contribution! I'm convinced this is a compelling use-case, but have a few concerns:
- Using internal iteration methods on this iterator will be a significant performance footgun when wrapping iterators that specialize those methods (like
Chain, orIntersperse). We should override the 'root' internal iteration method that other methods tend to call with one that directly calls that method on the inner iterator. Currently, that method isIterator::try_fold. Unfortunately, it was stabilized in 1.27, and Itertools supports 1.24.Iterator::foldis the next best choice. - The name
set_size_hintimplies the size hint of the wrapped iterator is being mutated. An alternative name. Perhapswith_size_hint? Unfortunately, in the standard library, the namewithtends to imply the function consumes a function as its argument. - This feature is a possible candidate for inclusion in the standard library. Per the inclusion policy, could you file a PR to rust-lang/rust?
This seems like a useful functionality.
- Performance-wise, it should call many of the original iterator's implementations in order to preserve the possible optimizations (and
countshould have a custom one based on thesize_hintand maybeskip). - I also feel like calls that consume part of the iterator should update the size hint accordingly. (its size changes as you consume elements).
- Regarding the previous note about the name, I understand the point but I can't think of a legitimate iterator where the size hint would be set-able and where it wouldn't consist in wrapping it into another one so I don't find it too confusing. I have another idea though : how about
overriding_size_hint()?
In any case, +1
I agree that this looks useful.
Could/should we debug_assert that the improved size_hint does not contradict (resp. worsen) the size_hint of the underlying iterator?
Just to note that try_fold is unfortunately out of bounds for itertools until it can be implemented in stable Rust (we can't name the Try trait in stable Rust yet).
@phimuemue I'd skip that check, it risks disrupting the program as much as it helps
@Ten0 I also feel like calls that consume part of the iterator should update the size hint accordingly. (its size changes as you consume elements).
I agree! @peterjoel, what do you think of this approach?
pub struct SetSizeHint<I> {
iter: I,
lower: usize,
upper: Option<usize>,
}
impl<I> Iterator for SetSizeHint<I>
where
I: Iterator,
{
type Item = I::Item;
#[inline(always)]
fn next(&mut self) -> Option<Self::Item> {
let next = self.iter.next();
if next.is_some() {
self.lower = self.lower.saturating_sub(1);
if let Some(ref mut upper) = self.upper {
*upper = upper.saturating_sub(1);
}
}
next
}
#[inline(always)]
fn fold<B, F>(mut self, init: B, mut f: F) -> B
where
Self: Sized,
F: FnMut(B, Self::Item) -> B,
{
let mut consumed: usize = 0;
let result = self.iter.fold(init, |acc, x| {
consumed = consumed.saturating_add(1);
f(acc, x)
});
self.lower = self.lower.saturating_sub(consumed);
if let Some(ref mut upper) = self.upper {
*upper = upper.saturating_sub(consumed);
}
result
}
#[inline(always)]
fn size_hint(&self) -> (usize, Option<usize>) {
(self.lower, self.upper)
}
}
There are three notable changes:
- The lower and upper bounds of the set size hint are decremented with each consumed item to ensure that the user's estimate of the size bounds are adjusted accordingly as they consume elements from the iterator.
foldis specialized to reduce the internal-iteration performance footgun of this adapter.- The
size_hinttuple is flattened into distinct lower and upper bounds fields. This gives rust a little more flexibility in how it lays outSetSizeHint(and potentially saves bytes).
- The lower and upper bounds of the set size hint are decremented with each consumed item to ensure that the user's estimate of the size bounds are adjusted accordingly as they consume elements from the iterator.
I'd want to make sure that the decrement is mostly optimised out in cases where the size hint ends up not being used.
I'd want to make sure that the decrement is mostly optimised out in cases where the size hint ends up not being used.
In these cases probably fold/for_each could be used, and also maybe one could avoid calling set_size_hint/override_size_hint in the first place.
That's true. There is no reason to use the method if you don't need it.
@jswrenn The methods you refer to, which use with to indicate that an argument is a function, always use _with as a suffix, so I think with_size_hint is ok. It also has precedence in with_hasher and with_capacity.