memcpy_sse
memcpy_sse copied to clipboard
Trying to make memcpy() go faster when compiled as a 32 bit binary.
Let's use sse3/128 bit copies.
32 bit gcc -march=sse3 -O3 -m32 testmem_modified.c -o tm32
64 bit gcc -march=sse3 -O3 -m64 testmem_modified.c -o tm64
usage:
./tm32 (or ./tm64) 32
for a 32 meg memory to memory copy. if you want, replace memcpy_sse with memcpy() which is a built-in function. When compiling with -m32 it is glacially slow even for architectures that have things more advanced than e.g. the eax register.
with default memcpy() on 32 bit compile:
./tm32 32 32 MB = 3.388237 ms -Compare match (should be zero): 0
with memcpy_sse() on 32 bit compile:
./tm32 32 32 MB = 0.759420 ms (LOWER IS ISANELY WAY BETTER...) -Compare match (should be zero): 0
./tm64 32 with standard memcpy() on 64 bit compile:
32 MB = 0.759102 ms (AS INTENDED) -Compare match (should be zero): 0
The test system is an AMD Threadripper 1950x Clocked at 4.1ghz with DDR4-3200 memory.
UPDATE: It has been suggested to try some additional gcc params to try not inlining memcpy(). This does help on SOME systems. But not all, not on Fedora 27+Threadripper for example.
#Non-inlined memcpy is sometimes, but not always, garbage. On this TR system it is garbage... gcc -march=k8-sse3 -m32 -O3 -o tm32 -fno-builtin-inline -fno-inline testmem_modified.c -S
This also doesn't work for memcpy speedup, which forces gcc to not inline memcpy (call as memcpy_ptr) void *(*memcpy_ptr)(void *, const void *, size_t) = memcpy;