compiler-builtins icon indicating copy to clipboard operation
compiler-builtins copied to clipboard

memcpy and co. are really unoptimized

Open CryZe opened this issue 4 years ago • 14 comments

I've seen the assembly of the memcpy here for both powerpc and wasm (where the one here seems to be used unless you use the WASI target) and in both cases really simplistic byte wise loops are emitted. I haven't really done any benchmarks but these likely perform worse than more optimized loops that work on 32-bit or so at a time.

CryZe avatar Feb 07 '20 16:02 CryZe

Redox (MIT-licensed) provides better pure-Rust implementations of those functions at https://gitlab.redox-os.org/redox-os/kernel/-/blob/master/src/externs.rs

zhaofengli avatar May 03 '20 21:05 zhaofengli

Redox (MIT-licensed) provides better pure-Rust implementations of those functions at https://gitlab.redox-os.org/redox-os/kernel/-/blob/master/src/externs.rs

I would caution against using Redox's (current) memcpy implementation as it relies on undefined behavior (doing an unaligned load on a usize). See: https://gitlab.redox-os.org/redox-os/kernel/issues/99. This is UB on all platforms, even those that allow for unaligned loads, and will immediately fault on platforms that ban unaligned loads (like ARM).

If you need an optimized implementation, you should probably just link against a libc that provides them. Example of how to do this for musl on bare-metal x86_64: https://github.com/rust-osdev/cargo-xbuild/issues/77#issuecomment-629945246

josephlr avatar May 18 '20 05:05 josephlr

May I ask that is there a way to specify other memcpy implementation? I'm on ARM platform, and the memcpy is not very efficient for my use case as I have many [u8] slices that have to be copied around.

I've tried to modify the target triple to not include the name -none to circumvent issue #369 , and managed to link against the ARM GCC Embedded Toolchain for the required memcpy and related symbols. However, the actual output is still linked to the compiler builtins implementation. It seems that the code to copy slices is still calling the __aeabi_* functions which uses the implementation defined in mem.rs.

https://github.com/rust-lang/compiler-builtins/blob/f4c7940d3b13ec879c9fdc218812f71a65149123/src/arm.rs#L143-L148

Related: https://github.com/rust-lang/compiler-builtins/issues/253

pca006132 avatar Aug 19 '20 09:08 pca006132

@pca006132 I think similar to #365, we could also add optimized variants of memcpy and friends for aarch64. I'm not sure what the best implementation would be for that architecture. Alternatively, the code in arm.rs could ink to the extern symbol memcpy instead of ::mem::memcpy, but improving the implementation in this crate is probably the best bet.

EDIT: It would seem that LLVM on AArch64 emits calls to __aeabi_memcpy, so there's not much we can do about that.

josephlr avatar Oct 02 '20 06:10 josephlr

@pca006132 I think similar to #365, we could also add optimized variants of memcpy and friends for aarch64. I'm not sure what the best implementation would be for that architecture. Alternatively, the code in arm.rs could ink to the extern symbol memcpy instead of ::mem::memcpy, but improving the implementation in this crate is probably the best bet.

EDIT: It would seem that LLVM on AArch64 emits calls to __aeabi_memcpy, so there's not much we can do about that.

It is good to have optimized variants for some more targets. However, I think embedded systems may have some different requirements than normal applications, and the best way would be to allow them to provide their own implementation if they have to. For example, for armv7a processors, we could use NEON optimized memcpy if we enabled the FPU, but there may be bare-metal programs running before enabling the FPU, so probably they want an unoptimized memcpy implementation which does not use NEON.

I think the easiest way is to use weak linkage, so users can replace it with their own implementation if they want, such as using the implementation from newlib. https://github.com/rust-lang/compiler-builtins/issues/378

pca006132 avatar Oct 02 '20 10:10 pca006132

I've experimented a bit with faster mem* functions on a Ryzen 2700X and a i5-5287U and noted the following:

  • For small copies/sets it is significantly faster to do two unaligned load stores like so:
#[inline]
unsafe fn small_copy(dest: *mut u8, src: *const u8, count: usize) {
    if count < 2 {
        *dest = *src;
        return
    }
    if count <= 4 {
        let a = src.cast::<u16>().read_unaligned();
        let b = src.add(count - 2).cast::<u16>().read_unaligned();
        dest.cast::<u16>().write_unaligned(a);
        dest.add(count - 2).cast::<u16>().write_unaligned(b);
        return
    }
    if count <= 8 {
        let a = src.cast::<u32>().read_unaligned();
        let b = src.add(count - 4).cast::<u32>().read_unaligned();
        dest.cast::<u32>().write_unaligned(a);
        dest.add(count - 4).cast::<u32>().write_unaligned(b);
        return
    }
    if count <= 16 {
        let a = src.cast::<u64>().read_unaligned();
        let b = src.add(count - 8).cast::<u64>().read_unaligned();
        dest.cast::<u64>().write_unaligned(a);
        dest.add(count - 8).cast::<u64>().write_unaligned(b);
        return
    }
    if count <= 32 {
        let a = src.cast::<u128>().read_unaligned();
        let b = src.add(count - 16).cast::<u128>().read_unaligned();
        dest.cast::<u128>().write_unaligned(a);
        dest.add(count - 16).cast::<u128>().write_unaligned(b);
        return
    }
}
  • For large copies it is faster to use non-temporal stores. This is probably correlated with L2 cache size. This does not seem to apply to rep stos.
  • It's probably worth checking & caching CPUID for ERMS, FSRM et al. and use only rep movsb when applicable.

My main concern is that these optimizations can significantly bloat the mem* functions, so while the performance may be better in benchmarks it may be worse in practice since it uses much more i-cache space.

Demindiro avatar Jul 07 '22 12:07 Demindiro

iirc LLVM already emits small copy operation as a set of load/store instead of calling memcpy. Not sure what is the threshold though.

pca006132 avatar Jul 07 '22 12:07 pca006132

iirc LLVM already emits small copy operation as a set of load/store instead of calling memcpy. Not sure what is the threshold though.

AFAICT it only does if it knows the size beforehand: https://rust.godbolt.org/z/KK7xj6Goh

Demindiro avatar Jul 07 '22 12:07 Demindiro

I think using the implementation in glibc is probably a better choice here, they are well optimized and are already using things like vectorization etc.

pca006132 avatar Jul 07 '22 13:07 pca006132

For large copies it is faster to use non-temporal stores.

Non-temporal stores bypass the cache, so it is expected that they are faster. Unfortunately this also means that unless you take care to ensure that the target of the write wasn't in the cache in the first place, you can get inconsistent results as future reads may read it from the cache again and I think if the cacheline is marked as modified that flushing it would overwrite the non-temporal store.

I think using the implementation in glibc is probably a better choice here, they are well optimized and are already using things like vectorization etc.

  1. Glibs is LGPL licensed.
  2. These memory operation implementations in compiler-builtins are only used on embedded systems where size is important and on wasm where you won't be able to vectorize much anyway. (The simd feature only gives 128bit vectors and in most cases rust wasm programs are compiled without simd support.)

bjorn3 avatar Jul 07 '22 13:07 bjorn3

We could also add an optimized version using RISC-V's Vector Extension (aka the V Extension). This idea was explored in this Reddit post and this writeup. It seems fairly fast and simple.

If we do:

asm!(
  "vsetvli  {n}, {count}, e8, m8, ta, ma",
  "vle8.v   v0, ({src})",
  "vse8.v   v0, ({dest})",
  ...
);

in a loop, we get assembly (Godbolt link) that is nearly identical to the handwritten example of a memcpy implementation in the spec.

I'm not super familiar with the V extension, but I imagine we could have similar implementations for the other memory functions.

josephlr avatar Jul 21 '22 08:07 josephlr

It would be nice if the loop itself could be inside the asm block. Codegen backends which don't have native inline asm support (like my cg_clif) turn asm blocks into calls to a function defined in assembly compiled using an external assembler. Doing such calls in a loop has a decent amount of overhead compared to doing it once and having the loop inside the asm block.

bjorn3 avatar Jul 21 '22 09:07 bjorn3

I've recently written some ARM memcpy implementations. Hopefully they can be PR'd into this repo at some point. The only catch is that they're handwritten a32 code and some of the newer ARM targets don't even support using a32 code at all, so they can't be instantly dropped in for all ARM targets.

if anyone wants to get started the code is over here (and all Apache-2.0 OR MIT just like normal for rust): https://github.com/Lokathor/arm7tdmi_aeabi/blob/b0861b2ec026eaad0142d75b138a2e02c6f38cb9/src/the_code.s

Lokathor avatar Jul 31 '22 19:07 Lokathor

For very small memcpys, could we not just use a couple of registers like llvm does? This should not much affect binary size, no?

ChrisDenton avatar Nov 01 '23 09:11 ChrisDenton