`slice::Iter` + `max` on array: Regression since Rust 1.65 (LLVM 15)
Code
I tried this code:
pub fn fna(x: [u8; 1024]) -> u8 {
x.into_iter().max().unwrap()
}
pub fn fnb(x: [u8; 1024]) -> u8 {
*x.iter().max().unwrap()
}
I expected to see this happen:
- Both functions should optimize away the unwrap
- (Bonus) Both functions should inline the respective iterator functions and auto-vectorize
Instead, this happened:
-
fnahas no unwrap in the assembly output, butfnbhas. - (Bonus) Both inline their respective iterators (according to the MIR output on the playground).
fnaauto-vectorizes, butfnbdoesn't.
fnb generates a byte-wise max implementation. Interestingly, it keeps some information about the array length around, influencing the codegen. A byte-wise max implementation for [u8; 1024] will need to do 1023 comparisons (since the first value of the array will be the initial max result value), and 1023 is divisible by 3, so the compiler emits a loop that emits a loop that runs 341 times, doing 3 max computations each. If I change the length of the array to 1025, it will do 4 max computations each.
Even though the generated code relies on the number of array entries, it still doesn't elide the Option::unwrap check for the max result. In Rust 1.45-1.64 fnb also generates the 3-byte-at-a-time loop, but does remove the Option::unwrap.
Version it worked on
It most recently worked on: Rust 1.64. (For context: Rust 1.65 added LLVM 15.)
It (initially?) started working in 1.45 (that version added LLVM 10).
Version with regression
Any version since Rust 1.65.
Since I thought this behavior was unexpected, I polled it on mastodon before looking further into it. ATM about half of the respondents thought just like me that both should be optimized (at least the unwrap part):
fnb is a by-reference iterator. If you use .copied().max() it vectorizes. See #106539
@the8472 ah, that explains the vectorization difference (bonus). just to clarify - that one is not the regression, just another thing I've noticed ^^
The slice by-reference iterator fully unrolls and optimizes out the unwrap for len <= 10: https://rust.godbolt.org/z/1cT4z91eP (at length 11 LLVM still unrolls but can't figure out the unwrap)
That's not a power of 2 so I feel like the distinction between 10 and 11 is a magic number in some LLVM optimization.
It looks like the 10 -> 11 difference is in InstCombine: https://rust.godbolt.org/z/hKof6449W Looks like we're getting some kind of special treatment until the loop exceeds 10 iterations.
@saethlin Yeah, the problem there is that we need to look through a very deep sequence of selects to determine that the final pointer is actually guaranteed to be non-null. This is going to hit a cutoff sooner or later.
I think we could support this without cutoffs inside SCCP, by starting from a NotConstant(Null) value lattice root for the nonnull argument and then propagating that across GEP inbounds. I'll try that when I have a chance.
https://github.com/llvm/llvm-project/pull/106090 should fix both the unrolled and not unrolled cases.
Looks like with the LLVM bump this is fixed in nightly 🎉
Seems like a plausible thing to have a test for, since we should be able to write a fairly focused one:
- Confirm that doing this on
&[u8]does have thecall{{.+}}unwrap_failed. - Confirm that doing this on
&[u8; 1024]does not have it.
(Notably: don't check for "no calls" or similar. There are loads of calls in the LLVM IR, like unsurprisingly to llvm.umax.)