libs-team icon indicating copy to clipboard operation
libs-team copied to clipboard

Sugar for special-casing last iteration

Open asmello opened this issue 8 months ago • 11 comments

Proposal

Problem statement

A common pattern that comes up when writing iterative logic over sequences is to special-case the first or last iterations. There are some existing APIs that help deal with this, like Iterator::fold(), which accepts an initial value for its accumulator, but there isn't a built-in general solution for implementing arbitrary logic that only applies at one extreme of iteration or another.

Special casing the first iteration is generally straightforward, as the first element can simply be handled before the loop. With that in mind, within this issue, let's focus on the mirror problem, of special casing the last iteration.

The most straightforward way to implement this in Rust looks like some variation over this:

for (i, elem) in elements.iter().enumerate() {
    do_something(elem);
    if i != elements.len() - 1 {
        between_elements(elem);
    } else {
        last_element(elem);
    }
}

A functional variant using .map(), .filter() or reduce() would use a similar conditional expression. There is nothing intrinsically wrong with this approach. With this, we assume the iterator iterates over a finite collection with known length, but there is no way around that, since we must be able to detect when to apply the special case. We could, instead, move the special case to outside the loop (and there would be performance benefits from doing so, at the cost of some potential code repetition), but even then we'd need to know the length in advance, so we can stop at the right iteration. While arguably not very elegant, this is a common pattern with little room for improvement.

There is, however, one aspect of this code that is particularly error-prone, and could use some language support. That is the condition i != elements.len() - 1. There are three "issues" with it:

  1. If the programmer forgets to subtract one from the length (to account for 0-indexing), the condition will be off-by-one, which may be one of the most common kind of bugs in all of computer software (sometimes with security implications).
  2. In the consuming variant of the loop (e.g., when using .into_iter()), then calling elements.len() is not possible inside of the loop. This can be easily fixed by assigning the value to a separate variable prior to initiating the loop, but introducing a new variable to the outer scope just for this one purpose makes for messier code.
  3. Checking for not equals puts the common case first and the exception in the else clause (if needed), but it also increases the likelihood that someone will get the logic backwards.

To be clear, these aren't critical issues; we have lived with versions of them since the first programming languages, and in all likelihood, after writing loops like this a few dozen times, most people just become used to it. However, with Rust being a language designed with the explicit intent to minimise bugs, I figured if we can do something to help avoid the pitfalls, we should.

Motivating examples or use cases

A good illustrative example is when implementing tree . Each element in a level must be preceded by ├──, except the last element, which should be preceded by └── to produce the correct output:

path/to/folder/
├── a-first.html
├── b-second.html
├── subfolder
│   ├── readme.html
│   ├── code.cpp
│   └── code.h
└── z-last-file.html

This also often comes up when writing delimiter-separated lists (the .join() method helps with many cases, but only if your intent is to collect, and not yield items in a stream-like API, or when recursing - both of which are common, for example, when implementing Display or sending to some external consumer).

For example, when serialising a map-like data structure to a json-like format:

write!(&mut w, "{")?;
for (i, (key, value)) in map.iter().enumerate() {
    write!(&mut w, "\"{key}\":\"{value}\"")?;
    if i != map.len() - 1 {
        write!(&mut w, ",")?;
    }
}
write!(&mut w, "}")?;

When interactively reading emails:

let mut idx = 0;
while idx < emails.len() {
    show(&emails[idx]);
    if idx == emails.len() - 1 && prompt("End of emails. Do you wish to start over? y/N") {
        idx = 0;
    } else {
        idx += 1;
    }
}

When flattening out a binary tree in-order into a comma-separated list:

fn flatten(node: &Node, out: &mut String, cnt: &mut usize, total: usize) {
    if let Some(left) = node.left() {
        flatten(left, out, cnt, total);
    }
    
    out.push_str(&node.to_string());
    *cnt += 1;
    if *cnt != total - 1 {
        out.push(',');
    }
    
    if let Some(right) = node.right() {
        flatten(right, out, cnt, total);
    }
}

There are many other cases where some special treatment is needed for the last iteration.

Solution sketch

A possible solution to issue (1) is to just add [T]::is_last_index(i: usize) -> bool. This would enable code like the following:

for (i, elem) in elements.iter().enumerate() {
    do_something(elem);
    if !elements.is_last_index(i) {
        between_elements(elem);
    } else {
        finalize(elem);
    }
}

impl<T> [T] {
    // valid for i in 0..usize::MAX
    pub fn is_last_index(&self, i: usize) -> bool {
        i + 1 == self.len()
    }
}

While this would desugar to equivalent code, it's more descriptive, reads better and abstracts away the one subtraction, thus eliminating the possibility of an off-by-one mistake.

What about consuming/mutable iterators?

We wouldn't be able to use this in conjunction with some kinds of iterators (e.g. .into_iter()), because we wouldn't be able to access elements from within the loop. So we'd need to revert to the original code.

A possible way to get around that would be to also introduce a [T]::is_last_index_fn() -> impl Fn(usize) -> bool. The returned closure would outlive the original slice, so it could be used within the loop. Still requires an extra variable to hold this closure, but at least it avoids the manual index check.

Why slice?

Slice is probably the most abstract representation of a finite ordered collection. It already implements many convenience methods like is_ascii() and starts_with(), and spiritually related ones like join(), first() and last(). Arrays and Vecs can be used as slices.

Alternatives

I had a look at ExactSizedIterator, but I think using that wouldn't be very practical:

let mut it = [1,2,3].into_iter();
while let Some(elem) = it.next() {
    do_something(elem);
    if !it.is_ended() {
        between_elements(elem);
    }
}

Arguably this is already supported using .peekable() + .peek().is_none(), though that's even more verbose.

Relatedly, the itertools crate provides an API for special casing the first and last positions: Iterator::with_position(). Instead of pairing the elements with the index, like .enumerate(), it pairs it with an enum that indicates what position the element is in (first, middle, last or single). I think this is also a great solution, though less useful when the index is also needed explicitly (put another way, this is awkward to use simultaneously with .enumerate()). Another arguable disadvantage is that the enum contains 4 variants, and handling all of them may not be convenient.

For the sake of completeness, another possible solution would be to implement a special syntax to capture last-loop semantics. As one concrete example:

for elem in elements {
    do_something();
    when last {
        finalize(elem);
    } else {
        between_elements(elem);
    }
}

Under the hood, the compiler would inject an .enumerate() into the iterator, and replace the when last expression with an if condition. The advantage of this approach is that it avoids the need for the programmer to explicitly track the index just for the sake of special casing the last loop, helping keep the namespace cleaner and the code more concise.

It could also be extended to support a when first variation (and possibly negative variants too, like when not last).

I genuinely don't believe this is a good solution - it's nice from a UX perspective but would introduce a lot of compiler complexity for a small gain. Still, it's worth mentioning.

Links and related work

I haven't seen this in other languages. Closest relative I can think of is Python's negative indices, which allow you to avoid including the list length into the index calculation, so that one can do if elem == elements[-1], but that is not exactly equivalent, as it's testing the value, not the index.

Iterator::with_position() from the itertools crate is also worth mentioning as a different approach to solving the same problem.

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

  • We think this problem seems worth solving, and the standard library might be the right place to solve it.
  • We think that this probably doesn't belong in the standard library.

Second, if there's a concrete solution:

  • We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
  • We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.

asmello avatar Nov 19 '23 18:11 asmello

Having the conditional inside the loop is problematic for performance. It's better to split the last element off before the loop:

let most = &elements[..elements.len()-1];
let last = &elements[elements.len()-1];

for elem in most.iter() {
    do_something(elem);
}
do_something(last);
finalize(last);

Or save the last element of the loop:

let mut last = None;
for elem in elements.iter() {
    do_something(elem);
    last = Some(elem);
}
if let Some(elem) = last {
    finalize(elem);
}

Because the inner conditional doesn't necessarily optimize well, resulting in the check actually running for every iteration.

So it might be better to have a helper function akin to for_each instead:

fn for_each_final(mut self, for_each: FnMut(&Self::Item), for_last: FnOnce(Self::Item)) {
    let Some(first) = self.next() else { return };
    for_each(&first);
    let mut last = first;
    while let Some(elem) = self.next() {
        for_each(elem);
        last = elem;
    }
    for_last(last);
}

pitaj avatar Nov 19 '23 19:11 pitaj

Having the conditional inside the loop is problematic for performance.

That's very true, but how often does it actually matter? If you're iterating over something with known length, odds are it's a small vector or array, and performance is not an issue.

I think introducing something like for_each_final() might be a good idea, too. But I also don't think that if that existed people would stop doing conditionals inside of loops. So I think it's worth exploring what can be done to improve that case, too.

Also, consider one might want to use map(), filter()or reduce(). Would we add _final()variants for all of those, too?

let most = &elements[..elements.len()-1];
let last = &elements[elements.len()-1];

for elem in most.iter() {
    do_something(elem);
}
do_something(last);
finalize(last);

Or save the last element of the loop:

let mut last = None;
for elem in elements.iter() {
    do_something(elem);
    last = Some(elem);
}
if let Some(elem) = last {
    finalize(elem);
}

These are great ways to avoid the error-prone conditional, but the first one involves some amount of code duplication (not too bad if the logic is indeed in a separate function, but how often is that the case?), and the second one feels a bit over-engineered. I wouldn't expect anyone to naturally come up with that one, unless they were trying really hard to avoid indices. And it's arguably a bit inefficient too.

Anyway, if there's a good way to make the more efficient thing easier to write I'm all for it. But lacking that, making conditionals-in-loops less error-prone is the goal.

asmello avatar Nov 19 '23 20:11 asmello

I'm generally not a fan of anything like this

for (i, elem) in elements.iter().enumerate() {
    do_something(elem);
    if elements.is_last_index(i) {
        finalize(elem);
    }
}

because it can't be extended to iter_mut because of the &mut exclusivity.

Anything on the iterator I think might as well just be peekable, because "this is the last one" doesn't seem special enough compared to other checks on the next item to be worth something else.

That leaves only an adapter to add something to the element type. WithPosition feels like too many types for core, so probably wouldn't happen.

So what if you did this as .rev().enumerate().rev()? Then you can check for i == 0 for the last element.

Why slice?

TBH, I really feel like split_last is the answer on slices. If you need to share code, you can make a closure or inner function.

scottmcm avatar Nov 20 '23 00:11 scottmcm

I actually think the more compelling case for this is when you want to not do something for the last item on the iterator. String joining being a classic example:

let mut out = String::new();
for (i, e) in str_slice.iter().enumerate() {
    out.push_str(e);
    if i < str_slice.len() - 1 {
        out.push(' ');
    }
}

I think we can all agree that < str_slice.len() - 1 isn't exactly a nice conditional. It could be !=, which makes it a little nicer at the cost of also making spidey-senses tingle around i >= str_slice.len(). I think the suggestion based on the comments above would be to write:

let mut out = String::new();
let [most @ .., last] = str_slice;
for e in most {
    out.push_str(e);
    out.push(' ');
}
out.push_str(last);

Though that is a) non-obvious and likely not what anyone would write at first try if you asked them to do this, and b) not even complete since you need to handle the case when the slice is empty (the single element case is fine since most will just be [], though that isn't obvious either). The full version probably looks more like:

let mut out = String::new();
if let [most @ .., last] = str_slice {
    for e in most {
        out.push_str(e);
        out.push(' ');
    }
    out.push_str(last);
}

If we're not operating on a slice, but instead a general-purpose Iterator, then this becomes even worse (though I think the proposal also doesn't fix that).

It's not a whole lot shorter to use the proposed is_last_index, though I do think it's easier to read and understand:

let mut out = String::new();
for (i, e) in str_slice.iter().enumerate() {
    out.push_str(e);
    if !str_slice.is_last_index(i) {
        out.push(' ');
    }
}

Admittedly, the .iter().enumerate() bit is pretty unfortunate, though in practice that's what people already have in their code to do this in the majority of cases (I assume — at least that's true for me).

jonhoo avatar Nov 20 '23 08:11 jonhoo

I actually think the more compelling case for this is when you want to not do something for the last item on the iterator.

That's a good point, I even had this shown in two out of my three examples (even though I got the first one backwards...). I've updated the snippets to highlight it.

let mut out = String::new();
if let [most @ .., last] = str_slice {
    for e in most {
        out.push_str(e);
        out.push(' ');
    }
    out.push_str(last);
}

To me this reinforces how tricky doing the "right" thing can be for this - both pitaj and I missed the special case for an empty slice... With the conditional inside the loop that edge case doesn't exist.

asmello avatar Nov 20 '23 08:11 asmello

I'm generally not a fan of anything like this

for (i, elem) in elements.iter().enumerate() {
    do_something(elem);
    if elements.is_last_index(i) {
        finalize(elem);
    }
}

because it can't be extended to iter_mut because of the &mut exclusivity.

Yeah, I agree that's a limitation, though I wonder how much generalisation matters? We could do the closure thing I mentioned, too.

Anything on the iterator I think might as well just be peekable, because "this is the last one" doesn't seem special enough compared to other checks on the next item to be worth something else.

I kinda agree.

That leaves only an adapter to add something to the element type. WithPosition feels like too many types for core, so probably wouldn't happen.

I'm on the fence about WithPosition. It's elegant, but it does introduce many variants, which has the potential to be annoying to handle. I probably need to write some code with it to get a feel for whether I like it or not. But at the same time, I don't think we need the solution here to be either/or.

So what if you did this as .rev().enumerate().rev()? Then you can check for i == 0 for the last element.

My 2c on this one is that it's doing a bit too much work. Also, not sure how intuitive this approach is in practice?

Why slice?

TBH, I really feel like split_last is the answer on slices. If you need to share code, you can make a closure or inner function.

That's a fair argument, but like Jon pointed out, this can be a bit tricky to get right in its own way, because you have to special case empty slices.

asmello avatar Nov 20 '23 09:11 asmello

I actually think the more compelling case for this is when you want to not do something for the last item on the iterator. String joining being a classic example:

let mut out = String::new();
for (i, e) in str_slice.iter().enumerate() {
    out.push_str(e);
    if i < str_slice.len() - 1 {
        out.push(' ');
    }
}

I don't find string joining like this compelling, because the easiest fix for this is to rotate the loop slightly: skip the space on the first iteration instead of on the last iteration

let mut out = String::new();
for (i, e) in str_slice.iter().enumerate() {
    if i != 0 {
        out.push(' ');
    }
    out.push_str(e);
}

or more directly as

let mut out = String::new();
let mut first = true;
for e in str_slice {
    if first {
        first = false;
    } else {
        out.push(' ');
    }
    out.push_str(e);
}

And indeed, Itertools::join uses that "different the first time" approach rather than trying to detect the last item: https://docs.rs/itertools/0.12.0/src/itertools/lib.rs.html#2270-2288

EDIT: Oh, and [&str]::join also uses that approach: https://doc.rust-lang.org/1.74.0/src/alloc/str.rs.html#138-142

scottmcm avatar Nov 20 '23 19:11 scottmcm

I don't find string joining like this compelling, because the easiest fix for this is to rotate the loop slightly: skip the space on the first iteration instead of on the last iteration (...)

let mut out = String::new();
let mut first = true;
for e in str_slice {
    if first {
        first = false;
    } else {
        out.push(' ');
    }
    out.push_str(e);
}

Nice! I really like this refactor, the flag variable is a bit sad, but overall it's quite nice and concise, and I don't see much room for error.

But it also begs the question, is there anything we can do to nudge people to that style?

I should've checked the std implementation of join, but I'd never come across this rotation approach, and from the other comments I'll bet it's not too common, since you're the first to propose it.

EDIT: the std implementation has quite a bit more to it, as it's got an advanced combination of high generalisation and unsafe optimisation, so I'm not sure if it's a great reference for discussing general practices, but I'll still take the point.

asmello avatar Nov 20 '23 21:11 asmello

I want to hear more opinions still, but perhaps to proactively keep the discussion useful, I think there's a couple of fundamental things we should also keep in mind:

  • Do we agree on what the common pattern is?
  • If so, do we agree that it is error prone?

I think if we can agree on those two points, we can figure out together what the right solution should be. There are clearly many ways to approach writing this kind of code, each with its own trade-offs, but the important thing is not figuring out the right style, it's figuring out how to make Rust code better in average [or replace with whatever measure of commonality].

I don't think anyone disagrees the style that I opened with is NOT the best choice for performance, or readability. But is it worth offering programmers an easy-to-use and easy-to-remember path towards doing something better? And if so, how can we do that?

Anyway, I'm fascinated by all the suggestions and I think different approaches to writing this code may reveal a better dialect we should encourage, just want to make sure this doesn't become a styling discussion.

asmello avatar Nov 20 '23 21:11 asmello

So what if you did this as .rev().enumerate().rev()? Then you can check for i == 0 for the last element.

My 2c on this one is that it's doing a bit too much work. Also, not sure how intuitive this approach is in practice?

I'd expect it to be essentially the same amount of work as anything using is_last_index, and thus needing the .enumerate() anyway.

Enumerate is double-ended when the inner iterator is DoubleEnded+ExactSize. And .rev() is still lazy -- it just makes calling next call next_back and vice versa -- so .rev().enumerate().rev() is still going to end up using the same methods for forward iteration on the original iterator, but it'll count down n-1, n-2, …, 1, 0 for the enumerating as it does so.

It would also be possible to have a new .etaremune() that does this without needing the DoubleEndedIterator bound -- just an ExactSizeIterator bound -- though I don't know if the double-endedness would ever be a real problem for people. Slices are of course double-ended.

Basically, using .rev().enumerate().rev() to know when it's the second-last element should perform essentially the same as using .enumerate() to know when it's the second element.

scottmcm avatar Nov 21 '23 16:11 scottmcm

Enumerate is double-ended when the inner iterator is DoubleEnded+ExactSize. And .rev() is still lazy -- it just makes calling next call next_back and vice versa -- so .rev().enumerate().rev() is still going to end up using the same methods for forward iteration on the original iterator, but it'll count down n-1, n-2, …, 1, 0 for the enumerating as it does so.

That's awesome, I keep being surprised by how smartly these iterators are implemented. Thank you for correcting my mental model.

It would also be possible to have a new .etaremune() that does this without needing the DoubleEndedIterator bound -- just an ExactSizeIterator bound -- though I don't know if the double-endedness would ever be a real problem for people. Slices are of course double-ended.

I think there might be something to explore here, not so much because of the improved bounds, and not even because it's more legible, but because it's going to be more discoverable. Perhaps adding a .rev_enumerate()or similar would be worth the trouble?

asmello avatar Nov 21 '23 18:11 asmello