compiler-builtins
compiler-builtins copied to clipboard
memcpy and co. are really unoptimized
I've seen the assembly of the memcpy here for both powerpc and wasm (where the one here seems to be used unless you use the WASI target) and in both cases really simplistic byte wise loops are emitted. I haven't really done any benchmarks but these likely perform worse than more optimized loops that work on 32-bit or so at a time.
Redox (MIT-licensed) provides better pure-Rust implementations of those functions at https://gitlab.redox-os.org/redox-os/kernel/-/blob/master/src/externs.rs
Redox (MIT-licensed) provides better pure-Rust implementations of those functions at https://gitlab.redox-os.org/redox-os/kernel/-/blob/master/src/externs.rs
I would caution against using Redox's (current) memcpy
implementation as it relies on undefined behavior (doing an unaligned load on a usize
). See: https://gitlab.redox-os.org/redox-os/kernel/issues/99. This is UB on all platforms, even those that allow for unaligned loads, and will immediately fault on platforms that ban unaligned loads (like ARM).
If you need an optimized implementation, you should probably just link against a libc
that provides them. Example of how to do this for musl
on bare-metal x86_64
: https://github.com/rust-osdev/cargo-xbuild/issues/77#issuecomment-629945246
May I ask that is there a way to specify other memcpy
implementation? I'm on ARM platform, and the memcpy
is not very efficient for my use case as I have many [u8]
slices that have to be copied around.
I've tried to modify the target triple to not include the name -none
to circumvent issue #369 , and managed to link against the ARM GCC Embedded Toolchain for the required memcpy
and related symbols. However, the actual output is still linked to the compiler builtins implementation. It seems that the code to copy slices is still calling the __aeabi_*
functions which uses the implementation defined in mem.rs
.
https://github.com/rust-lang/compiler-builtins/blob/f4c7940d3b13ec879c9fdc218812f71a65149123/src/arm.rs#L143-L148
Related: https://github.com/rust-lang/compiler-builtins/issues/253
@pca006132 I think similar to #365, we could also add optimized variants of memcpy
and friends for aarch64
. I'm not sure what the best implementation would be for that architecture. Alternatively, the code in arm.rs
could ink to the extern
symbol memcpy
instead of ::mem::memcpy
, but improving the implementation in this crate is probably the best bet.
EDIT: It would seem that LLVM on AArch64 emits calls to __aeabi_memcpy
, so there's not much we can do about that.
@pca006132 I think similar to #365, we could also add optimized variants of
memcpy
and friends foraarch64
. I'm not sure what the best implementation would be for that architecture. Alternatively, the code inarm.rs
could ink to theextern
symbolmemcpy
instead of::mem::memcpy
, but improving the implementation in this crate is probably the best bet.EDIT: It would seem that LLVM on AArch64 emits calls to
__aeabi_memcpy
, so there's not much we can do about that.
It is good to have optimized variants for some more targets. However, I think embedded systems may have some different requirements than normal applications, and the best way would be to allow them to provide their own implementation if they have to. For example, for armv7a processors, we could use NEON optimized memcpy if we enabled the FPU, but there may be bare-metal programs running before enabling the FPU, so probably they want an unoptimized memcpy implementation which does not use NEON.
I think the easiest way is to use weak linkage, so users can replace it with their own implementation if they want, such as using the implementation from newlib. https://github.com/rust-lang/compiler-builtins/issues/378
I've experimented a bit with faster mem*
functions on a Ryzen 2700X and a i5-5287U and noted the following:
- For small copies/sets it is significantly faster to do two unaligned load stores like so:
#[inline]
unsafe fn small_copy(dest: *mut u8, src: *const u8, count: usize) {
if count < 2 {
*dest = *src;
return
}
if count <= 4 {
let a = src.cast::<u16>().read_unaligned();
let b = src.add(count - 2).cast::<u16>().read_unaligned();
dest.cast::<u16>().write_unaligned(a);
dest.add(count - 2).cast::<u16>().write_unaligned(b);
return
}
if count <= 8 {
let a = src.cast::<u32>().read_unaligned();
let b = src.add(count - 4).cast::<u32>().read_unaligned();
dest.cast::<u32>().write_unaligned(a);
dest.add(count - 4).cast::<u32>().write_unaligned(b);
return
}
if count <= 16 {
let a = src.cast::<u64>().read_unaligned();
let b = src.add(count - 8).cast::<u64>().read_unaligned();
dest.cast::<u64>().write_unaligned(a);
dest.add(count - 8).cast::<u64>().write_unaligned(b);
return
}
if count <= 32 {
let a = src.cast::<u128>().read_unaligned();
let b = src.add(count - 16).cast::<u128>().read_unaligned();
dest.cast::<u128>().write_unaligned(a);
dest.add(count - 16).cast::<u128>().write_unaligned(b);
return
}
}
- For large copies it is faster to use non-temporal stores. This is probably correlated with L2 cache size. This does not seem to apply to
rep stos
. - It's probably worth checking & caching CPUID for ERMS, FSRM et al. and use only
rep movsb
when applicable.
My main concern is that these optimizations can significantly bloat the mem*
functions, so while the performance may be better in benchmarks it may be worse in practice since it uses much more i-cache space.
iirc LLVM already emits small copy operation as a set of load/store instead of calling memcpy. Not sure what is the threshold though.
iirc LLVM already emits small copy operation as a set of load/store instead of calling memcpy. Not sure what is the threshold though.
AFAICT it only does if it knows the size beforehand: https://rust.godbolt.org/z/KK7xj6Goh
I think using the implementation in glibc is probably a better choice here, they are well optimized and are already using things like vectorization etc.
For large copies it is faster to use non-temporal stores.
Non-temporal stores bypass the cache, so it is expected that they are faster. Unfortunately this also means that unless you take care to ensure that the target of the write wasn't in the cache in the first place, you can get inconsistent results as future reads may read it from the cache again and I think if the cacheline is marked as modified that flushing it would overwrite the non-temporal store.
I think using the implementation in glibc is probably a better choice here, they are well optimized and are already using things like vectorization etc.
- Glibs is LGPL licensed.
- These memory operation implementations in compiler-builtins are only used on embedded systems where size is important and on wasm where you won't be able to vectorize much anyway. (The simd feature only gives 128bit vectors and in most cases rust wasm programs are compiled without simd support.)
We could also add an optimized version using RISC-V's Vector Extension (aka the V Extension). This idea was explored in this Reddit post and this writeup. It seems fairly fast and simple.
If we do:
asm!(
"vsetvli {n}, {count}, e8, m8, ta, ma",
"vle8.v v0, ({src})",
"vse8.v v0, ({dest})",
...
);
in a loop, we get assembly (Godbolt link) that is nearly identical to the handwritten example of a memcpy implementation in the spec.
I'm not super familiar with the V extension, but I imagine we could have similar implementations for the other memory functions.
It would be nice if the loop itself could be inside the asm block. Codegen backends which don't have native inline asm support (like my cg_clif) turn asm blocks into calls to a function defined in assembly compiled using an external assembler. Doing such calls in a loop has a decent amount of overhead compared to doing it once and having the loop inside the asm block.
I've recently written some ARM memcpy implementations. Hopefully they can be PR'd into this repo at some point. The only catch is that they're handwritten a32
code and some of the newer ARM targets don't even support using a32
code at all, so they can't be instantly dropped in for all ARM targets.
if anyone wants to get started the code is over here (and all Apache-2.0 OR MIT
just like normal for rust): https://github.com/Lokathor/arm7tdmi_aeabi/blob/b0861b2ec026eaad0142d75b138a2e02c6f38cb9/src/the_code.s
For very small memcpys, could we not just use a couple of registers like llvm does? This should not much affect binary size, no?