level1wendell/memcpy

Trying to make memcpy() go faster when compiled as a 32 bit binary.

Let's use sse3/128 bit copies.

32 bit gcc -march=sse3 -O3 -m32 testmem_modified.c -o tm32

64 bit gcc -march=sse3 -O3 -m64 testmem_modified.c -o tm64

usage:

./tm32 (or ./tm64) 32

for a 32 meg memory to memory copy. if you want, replace memcpy_sse with memcpy() which is a built-in function. When compiling with -m32 it is glacially slow even for architectures that have things more advanced than e.g. the eax register.

with default memcpy() on 32 bit compile:

./tm32 32 32 MB = 3.388237 ms -Compare match (should be zero): 0

with memcpy_sse() on 32 bit compile:

./tm32 32 32 MB = 0.759420 ms (LOWER IS ISANELY WAY BETTER...) -Compare match (should be zero): 0

./tm64 32 with standard memcpy() on 64 bit compile:

32 MB = 0.759102 ms (AS INTENDED) -Compare match (should be zero): 0

The test system is an AMD Threadripper 1950x Clocked at 4.1ghz with DDR4-3200 memory.

UPDATE: It has been suggested to try some additional gcc params to try not inlining memcpy(). This does help on SOME systems. But not all, not on Fedora 27+Threadripper for example.

#Non-inlined memcpy is sometimes, but not always, garbage. On this TR system it is garbage... gcc -march=k8-sse3 -m32 -O3 -o tm32 -fno-builtin-inline -fno-inline testmem_modified.c -S

This also doesn't work for memcpy speedup, which forces gcc to not inline memcpy (call as memcpy_ptr) void *(*memcpy_ptr)(void *, const void *, size_t) = memcpy;

memcpy_sse
memcpy_sse copied to clipboard

Metadata

← Metadata

Owner

Metadata

memcpy_sse memcpy_sse copied to clipboard

Metadata

← Metadata

Owner

Metadata

memcpy_sse
memcpy_sse copied to clipboard