Not using the byval attribute loses memcpy optimizations
Consider this code (Godbolt link):
pub struct S([i32; 16]);
#[inline(never)]
fn g(x: S) {
println!("{}", x.0[3]);
}
#[inline(never)]
fn f(x: S) {
g(x)
}
pub fn main() {
f(S([0; 16]))
}
LLVM can't eliminate the memcpy between f and g, even though it should legally be able to do so. This is because memcpyopt can only forward to parameters marked byval. We don't use the byval attribute, and this seems to be by design. But we lose this optimization, which I've observed hurting codegen in several places even in hello world (core::fmt::Write::write_fmt, core::panicking::assert_failed(), <core::fmt::Arguments as core::fmt::Display>::fmt). I suspect losing this optimization hurts us all over the place.
There are two solutions I can see:
(1) Use byval for indirect arguments in the Rust ABI.
(2) Change LLVM to allow the optimization to happen for at least nocapture noalias readonly parameters. Since nocapture implies that the behavior of the callee doesn't depend on the address and noalias readonly implies that the memory is strongly immutable, this should work. We mark all indirect by-value arguments as nocapture noalias already.
I'm working on a patch for (2), but I was wondering why we can't just do (1).
Note that if we choose option (2) then there will need to be a little more work to tell LLVM that an indirectly-passed function argument was marked immutable by the programmer.
Actually, (2) is more complicated than I thought. There are actually two sub-options:
(2a) At the Rust ABI level, specify that callees are not allowed to write to indirectly-passed arguments. Currently, they can if they declare a by-val parameter mut (or it has UnsafeCell in it). With this change a callee would have to codegen a memcpy to a temporary if it writes to a by-val parameter.
(2b) Keep the ABI the same as it is today. In that case we would make LLVM responsible for adding readonly attributes itself. Generally this can only be done when building with LTO.
Here's the LLVM change. Coupled with the suggested change in Rust ABI this should let us eliminate quite a few memcpys. https://github.com/pcwalton/llvm-project/commit/854b7c4a942e82b99f5f8fba47edd3ed63d258cb
Also, I tried out (2b) and the results were disappointing: LLVM can't even tell that println is readonly in fat LTO. I think if we want to get rid of these memcpys (and we probably do, as they're a regression over C++) then we should modify the Rust ABI either to use byval or to forbid callees writing to indirectly-passed args.
We don't use the byval attribute, and this seems to be by design.
I'm curious about why; any chance you have a link handy to a discussion about it?
I'm curious about why; any chance you have a link handy to a discussion about it?
I don't, but a comment from @bjorn3 seemed to indicate to me that it was deliberate.
byval means that the argument is passed at a specific stack location. This requires a copy to this stack location and is as such incompatible with unsized arguments. To solve the issue of these redundant copies I have long been advocating to use move arguments (which pass the orginal location in case of by-ref) instead of copy arguments (which copy the arguments to the stack and then pass the copy's location in case of by-ref) during MIR building where possible, but @RalfJung would like to instead remove move arguments entirely and handle unsized arguments some other way.
@RalfJung Can we get some forward progress on what to do here? These copies are hurting codegen across the board. I don't think the status quo is going to be tenable long-term; the pressure for a solution will be too great.
If I understand correctly, my proposal would be for all arguments to continue to be by-copy, but the copy moves from the caller side to the callee side and as such can be optimized away if known to be immutable. (Callers can't know this because Rust function signatures don't specify mutability of by-value arguments, only definitions do.)
Your proposal of using byval would still do the copy at call site if the argument couldn't be constructed in place at the right stack location. For example when there are multiple calls with different byval arguments.
Your proposal of making arguments immutable is incompatible with unsized arguments as those can be mutated in place and can't be copied at the callee side to make the copy at the callee side.
So per the unsized arguments RFC unsized arguments are passed as &move. This implies that we're already using a different calling convention for unsized arguments. So my immutable arguments proposal could be a replacement for the cases in which we're using by-copy--the ABI for unsized arguments would remain the same as today.
by-ref sized arguments are currently also passed &move. We just insert the extra copy at MIR building time as precaution due to unclear semantics of move arguments. Note that we need both to be abi compatible for trait object dispatch I think as is already stable through Box<dyn FnOnce()>.
What a mess. Maybe we could just mark by-ref arguments as readonly in the callee if they're declared immutable and freeze, and then rely on LTO to propagate that attribute to callers. Then memcpyopt can detect the readonly nocapture noalias combo and forward the memcpy.
The right solution here for the unsized case is to stop representing &move arguments in MIR with an Operand::Move. Operand::Move is also very clear about the fact that it makes a copy. This would require someone to go and actually come up with a design for a decent story for unsized locals in MIR though...
I still think it'd make sense to mark semantically-by-value but passed-by-ref-at-the-ABI-level immutable freeze arguments as readonly, since we know that, and LLVM can use it for LTO more generally than just memcpyopt.
@JakobDegen We can easily use a separate type for call arguments to solve the problem of Operand::Move being specified as doing a copy, but that won't solve this perf issue. The problem is that we aren't using Operand::Move too often, but that we use it too little, thus causing many needless copies.
I wonder what the best way to propagate deduced readonly around across crates would be. Either it would have be a per-parameter bit in the Rust metadata, or it'd have to be added to the ThinLTO summaries.
I verified that with my change to LLVM with the soundness fix suggested by @nikic on Zulip, it can do the optimization for the cases I'm looking at as long as Rust codegen can either inspect the MIR for direct callees for readonly safety or to pull that information from the crate metadata. I think we have a way forward for most cases.
It would still be better to not generate the copy at the MIR level, but at least this will fix the worst offenders in the generated code.
I'm sorry I am not familiar with LLVM byref. What new requirements do we need to impose on the MIR level to justify this attribute?
From the LangRef it sounds like using byref for Operand::Copy arguments is a no-brainer, and what is correct for Operand::Copy is certainly also correct for Operand::Move. When we use pointers to implement functions that, on the MIR level, perform a copy -- why would byval ever be wrong?
I was pointed to this GH issue by @lqd; I don't think opening a separate issue is warranted since I believe the codegen quality problem I observed is this memcpy issue. So maybe this is useful as a test case for your work, @pcwalton: https://godbolt.org/z/bTn7bP574
With the "move-through" pattern it seems very common to do sparse updates with mut self like in with_size1, so it's surprising and not ideal that it produces worse code than with_size2. You can see that rather than writing the size once to the width/height fields, it does a full store/load/store and the last load/store pair comes from the memcpy intrinsic. (In this and similar examples inlining can usually save the day, but that could be said for most cross-function issues.)
It would be even better if this kind of move-through pattern could be lowered to the copy-free code that just does
mov [rdi], rsi mov [rdi+8], rsi ret
so that codegen would be no different from a &mut-based implementation. But I'm guessing there are good reasons that can't be done (and in any case is a separate issue).