rust icon indicating copy to clipboard operation
rust copied to clipboard

`Result` that uses niches resulting in a final size of 16 bytes emits poor LLVM IR due to ABI

Open asquared31415 opened this issue 3 years ago • 21 comments
trafficstars

The following code:

pub fn bad() -> Result<u64, core::ptr::NonNull<()>> {
    Ok(0)
}

#[allow(improper_ctypes_definitions)]
pub extern "C" fn good() -> Result<u64, core::ptr::NonNull<()>> {
    Ok(0)
}

godbolt LLVM IR godbolt ASM

Both functions should be identical, returning the 128-bit value in RAX and RDX per the sysv-64 ABI, which is the default for the environment godbolt uses by default.

Instead, when returning the value without specifying any ABI, the compiler chooses to return the value using an out pointer argument.

Note: This affects Result<usize, std::io::Error>, which is a particularly useful type and affects a significant portion of std::io, such as Read and Write

Meta

This has improved and regressed since 1.43.0, the first version I could find which would use 128 bits for the type. All versions tested used the expected code generation for the extern "C" function. All tests done using https://godbolt.org and no target overrides, which in practice is a x86_64 linux machine using the sysv-64 ABI.

..= 1.42.0: niche not used, cannot be compared 1.43.0 ..= 1.47.0: codegen like today, uses out ptr for return 1.48.0 ..= 1.60.0: returns using an LLVM i128, which produces the most desirable code on sysv-64 1.61.0 .. (current 2022-05-28 nightly): poor codegen that uses an out ptr for return

@rustbot label +regression-from-stable-to-stable +A-codegen

asquared31415 avatar May 30 '22 01:05 asquared31415

To be clear, it's debatable if this is a regression (if it is, it's just a performance issue), since we make no guarantee about the ABI of extern "Rust".

That said, this likely applies to all 128 bit Result types, so could have a non-negligible impact on overall performance.

thomcc avatar May 30 '22 01:05 thomcc

This is #26494 / #77434, reverted by #85265 / #94570

erikdesjardins avatar May 30 '22 02:05 erikdesjardins

This is interesting, because tuples or structs seem to have some exceptions for two usize-sized elements specifically. (u64, u64) and (u64, *mut ()) return by value, while (u32, u32, u32, u32), (u64, u32), and [u32; 4] (and all arrays) return with a pointer. Additionally structs containing two usize values exactly are returned by value (both with and without #[repr(C)].

The Result niche usage is specifically inhibiting this optimization, since otherwise the data returned would look like a (u64, *mut ()) in the Result<u64, core::ptr::NonNull<()>> or std::io::Result<usize> cases.

asquared31415 avatar May 30 '22 03:05 asquared31415

To be clear, it's debatable if this is a regression (if it is, it's just a performance issue), since we make no guarantee about the ABI of extern "Rust".

Performance regression is a regression. And I think this is a very bad one.

If I replace the Ok(0) with Ok(1) then it even starts to loading from memory.

I don't think this is related to niche though; Result<u64, core::num::NonZeroUsize> still have the good old codegen. Somehow Result<u64, core::ptr::NonNull<()>> ceased to be passed as a scalar pair, while Result<u64, core::num::NonZeroUsize> still does.

nbdd0121 avatar May 30 '22 14:05 nbdd0121

Ah, NonZeroUsize or NonZeroU64 don't reproduce it, but NonZeroU{32, 16, 8} do return via an out ptr.

asquared31415 avatar May 30 '22 16:05 asquared31415

Okay, I located the reason. Abi::ScalarPair is only produced if all variant' data part are scalar and have the same abi::Primitive. NonZeroUsize and NonZeroU64 are all Primitive::Int(Integer::I64, false) which matches that of u64, so it's treated as a scalar pair. However NonZeroU{32,16,8} have different integer size, and NonNull have Primitive::Pointer which never matches with any integer, so ScalarPair is not produced and it's treated as Aggregate.

Before #94570, something this small is passed in registers, so it is optimized to be the same as if it's passed as scalar pair, but it's not, it's actually just an small aggregated passed directly. #94570 forces it to be an indirect passing, causing the perf regression.

Given the motivation of #94570 is to workaround LLVM inability to autovectorize things passed in register, I think a fix would to be use some heurstics to determine if a 2-usize type can be a target of autovectorization -- if so, pass indirectly, otherwise, pass by value.

nbdd0121 avatar May 30 '22 17:05 nbdd0121

@nbdd0121 Look at https://github.com/rust-lang/rust/pull/93564 which was a more targeted fix. Maybe it could be reopened ?

EDIT: Looking into it right now.

Urgau avatar May 30 '22 17:05 Urgau

Thanks for the pointer. That PR looks like a better approach to me. I think the compile-time perf regression in that PR results from excessive homogeneous_aggregate calls when the type is smaller or equal to pointer size (which includes almost all primitives!). I suspect that if you change the nesting of ifs, so that it's only called on types of size between 1usize and 2usize, then the perf regression should be mitigated.

nbdd0121 avatar May 30 '22 18:05 nbdd0121

@nbdd0121 I've reopened https://github.com/rust-lang/rust/pull/93564 as https://github.com/rust-lang/rust/pull/97559. I've confirmed that with the PR the regression is fixed and hopefully the perf regression of the PR is also fixed.

Urgau avatar May 30 '22 19:05 Urgau

Assigning priority as discussed in the Zulip thread of the Prioritization Working Group.

@rustbot label -I-prioritize +P-high +T-compiler

apiraino avatar May 31 '22 13:05 apiraino

This issue only dubiously shows a codegen regression (it is Not Great, to be clear, but it's not clear from this example and my knowledge of how common x86 microarchitectures work that it's Actively Awful). If it is, it does not show it is a performance impact in practical cases, as it is still very few instructions and then a simple ret. This requires a benchmark to show what we are actually trying to improve on in terms of runtime, or comparison for multiple values, not just the zero cases, or even just something that shows that the impact is more significant than a xorps and mov where a xor would do.

workingjubilee avatar Jul 01 '22 05:07 workingjubilee

An alternative would be experimenting more with this in context of other code we expect it to live in, because LLVM eagerly inlining can make problems go away, as cited here: https://github.com/rust-lang/rust/issues/91447#issuecomment-1059840995

Small functions like this, in particular, will tend to be inlined by any backend, precisely because Rust functions are not made external by default. Thus they do not have to respect the System V ABI. Or any ABI, for that matter, and they can make things up as they go along. Constant folding becomes quite practical to do. So, many functions offer seemingly terrible codegen in isolation that doesn't actually come up as a problem in actual practice. This is much more of an issue when we're mostly touching how they do parameter passing, which is precisely what inlining can make disappear, which is why I bring it up.

workingjubilee avatar Jul 01 '22 05:07 workingjubilee

BTW, in our project, we also noticed performance degradation because of this regression (small 16B Results returned via the stack instead of registers), so I vote for a quick fix. Ideally, all return values of up to 16B size should be returned in registers (at least on x64 and arm64 and at least for integer/pointer values, not sure about floats and the like). Thanks.

schreter avatar Jul 07 '22 09:07 schreter

BTW, in our project, we also noticed performance degradation because of this regression (small 16B Results returned via the stack instead of registers), so I vote for a quick fix. Ideally, all return values of up to 16B size should be returned in registers (at least on x64 and arm64 and at least for integer/pointer values, not sure about floats and the like). Thanks.

I would sincerely appreciate it if this was quantified in some manner or if you could give a more complex example. It's not that I don't think you are correct, it's just that unfortunately the quickest fix is to simply revert the changes that regressed this outright. But reverting the changes will simply regress the performance in other areas, and the losses there were quite significant.

That's not acceptable, but this can still be fixed. We need to find something else instead as a solution, and more benchmarks or even a nuanced array of codegen examples means we are less likely to perform an overly sophomoric "fix" that could make things even worse for real code at the expense of improving academic cases. It will be fixed, but not with the quickest fix, except in the sense of making things faster.

Honestly, the question I have is why it isn't e.g. returned in a single xmm register instead. It doesn't make sense to me that LLVM would drop a value that can clearly already fit in a register on the stack, ever.

workingjubilee avatar Jul 11 '22 17:07 workingjubilee

It's the rustc that makes return value indirect, instead of LLVM.

nbdd0121 avatar Jul 11 '22 18:07 nbdd0121

Honestly, the question I have is why it isn't e.g. returned in a single xmm register instead. It doesn't make sense to me that LLVM would drop a value that can clearly already fit in a register on the stack, ever.

The ABI is decided by rustc here, not LLVM. If rustc tells LLVM to use an sret argument, then it must do so.

Returning an i64 pair in an XMM register is also a bad idea because it requires GPR to XMM moves, even if we ignore issues related to the target-feature dependence of XMM registers.

nikic avatar Jul 11 '22 18:07 nikic

I would sincerely appreciate it if this was quantified in some manner or if you could give a more complex example.

@workingjubilee Yes, we need to quantify this, but it's not that trivial. Our project is fairly big and the benchmark directly or indirectly touches 200K+ LOCs of our code and likely millions of LOCs of code from other crates, so it's not easy to quantify. I'd estimate the average performance loss 5% across the board from 1.60 -> 1.62.

We do use quite a few functions returning Results in tight loops (fallible value matchers, pushing results to bitmap sinks). I can try to create a manageable repro case with this, if it would help.

schreter avatar Jul 11 '22 19:07 schreter

We do use quite a few functions returning Results in tight loops (fallible value matchers, pushing results to bitmap sinks). I can try to create a manageable repro case with this, if it would help.

Yes, one of the smaller loops that is like what you use, if it can show an even somewhat similar regression (may need generous application of std::hint::black_box), would be excellent.

The ABI is decided by rustc here, not LLVM. If rustc tells LLVM to use an sret argument, then it must do so.

Hm. Honestly, I guess that after the amount of noodling around I have done with LLVMIR and the way the LangRef phrases byval's documentation, implying it is only a polite suggestion, I didn't fully grok that LLVM really interprets sret as a true unconditional directive.

workingjubilee avatar Jul 18 '22 06:07 workingjubilee

@workingjubilee I've tried to implement something with black_box, unfortunately, Rust happily optimizes it out, so I have to find a different approach (didn't have much time yet, sorry).

What we also see and that is probably at least equally big problem is the handling of larger Futures, where a lot of unnecessary memory copying is in play. After doing some Result-based optimizations, these memcpy() calls are the hottest function in the profile. What happens:

  • Calling an async function in fact calls a stub producing a Future (which is in fact a generator, which is later polled). This Future is stored in the parent's async function "stack", which is simply an aggregated Future of the parent function and the called function (with some layout optimizations).
  • Unfortunately, instead of telling the Future generator to materialize the Future directly into the proper place, the Future is materialized on the regular stack and then copied from the stack to the caller's Future.
  • Now, we have a loop calling a request handler which supports various request types (dispatched by a match statement) where one of them produces a large Future. Then, the call to the request handler materializes a Future by writing a few words to the stack and then this (practically empty) space is copied in full from the stack to the parent Future.

This wastes cycles like there is no tomorrow - instead of materializing the callee future on the stack, the async functions should materialize the callee future directly in-place in the caller's future. That would save the copying (and, especially, copying of still uninitialized space!).

I'm wondering if that has something to do with the Result issue as well or not. In any case, I couldn't find a specific issue for this problem in rust-lang issues. Are you aware of this problem already or not?

I'm definitely no expert in compiler building, but if I understand it correctly, the LLVM sret attribute takes a location where to materialize the return value. So it should be possible to specify the location inside of the caller's future instead of the stack as such (and thus greatly optimize async function start). If not, possibly, there is a different LLVM feature which could be used instead (or the ABI modelled differently, by passing a hidden parameter where to materialize the generator).

schreter avatar Jul 19 '22 13:07 schreter

Note that this also happens for Result<SomePointer, SomeInteger>. I saw some suggestion that single-pointer Error types (std::io::Error, anyhow::Error, ThinBox<dyn Error>, ...) could work around this bug by storing their pointers in a NonZeroUsize when not running under miri. Ignoring the fact that this is gross, it would just move the problem so that you get the bad codegen for cases where Ok is a pointer/ref/etc instead (stuff like Result<&T, anyhow::Error>). So, just so nobody else is confused, this does not seem to be a case where you can abandon your morals in the name of performance in order to work around this, you'll just hit it in different cases.


Anyway, I do think it's kinda bad[^small]. Most Rust code is very result heavy. I think quantifying this will be very hard though -- For example, rust code often looks like: https://godbolt.org/z/ETKfod3r5. The ptrs code is clearly worse than ints -- those reg/reg movs are free (on most modernish machines), but the stack ops are reads/writes[^4] to memory. Sadly, I'm not sure this is something that will be easy to measure...

[^4]: Well, the writes are implied, since they'd be in the callee.

Notably, black_box in particular would be a problem if used too aggressively -- it currently basically forces stack spilling of the argument (see #99899), so you'd need to use it in such a way that this overhead didn't overshadow the ABI wonkiness. Constructing a benchmark out of something like the linked godbolt but with more cases might be an okay thing to measure in a benchmark, with some caveats like measuring min time rather than avg (but I'd suspect it not to still kinda suck for all the normal reasons benchmarks suck[^1])

For codegen issues, something that models the CPU resources like llvm-mca may be the best bet, but it tends to be terrible at things with calls (so, as always, real measurements > artificial benchmarks). For this case, because they both have basically equivalent calls it might be fine, and MCA does seem to suggest that the ints code will perform decidedly better than ptrs, at least on the CPU it's simulating. (But yeah, don't read into how much less, due to the calls)

[^small]: Well, I'd expect this to be pretty small in terms of absolute impact, it just may be a small overhead to a fairly large amount of code.

[^1]: TBH, I've all but given up[^3] on artificial benchmarks at this point unless carefully constructed. Processors are getting very smart at making running the same code in a loop repeatedly (the way you do in a benchmark) an inaccurate way to measure time that that code will normally take to execute.

[^3]: Which is to say I still use them, they just make me sad.

thomcc avatar Aug 02 '22 17:08 thomcc

Hmm. It kinda sucks that we don't use higher-level information from MIR or thereabouts to annotate types with "wants vectorization" and "doesn't want vectorization". MIR is about where we know which cases of (isize, isize) would prefer to go into a vector register and which would be just as happy being passed in two general registers.

workingjubilee avatar Aug 06 '22 07:08 workingjubilee