FastMemcpy
FastMemcpy copied to clipboard
Slower on later GCC
This actually appears to be slower on GCC 5.4
benchmark(size=32 bytes, times=16777216): result(dst aligned, src aligned): memcpy_fast=42ms memcpy=48 ms result(dst aligned, src unalign): memcpy_fast=46ms memcpy=54 ms result(dst unalign, src aligned): memcpy_fast=43ms memcpy=53 ms result(dst unalign, src unalign): memcpy_fast=44ms memcpy=55 ms
benchmark(size=64 bytes, times=16777216): result(dst aligned, src aligned): memcpy_fast=44ms memcpy=57 ms result(dst aligned, src unalign): memcpy_fast=44ms memcpy=60 ms result(dst unalign, src aligned): memcpy_fast=43ms memcpy=65 ms result(dst unalign, src unalign): memcpy_fast=43ms memcpy=62 ms
benchmark(size=512 bytes, times=8388608): result(dst aligned, src aligned): memcpy_fast=77ms memcpy=56 ms result(dst aligned, src unalign): memcpy_fast=82ms memcpy=61 ms result(dst unalign, src aligned): memcpy_fast=81ms memcpy=61 ms result(dst unalign, src unalign): memcpy_fast=79ms memcpy=61 ms
benchmark(size=1024 bytes, times=4194304): result(dst aligned, src aligned): memcpy_fast=79ms memcpy=45 ms result(dst aligned, src unalign): memcpy_fast=77ms memcpy=47 ms result(dst unalign, src aligned): memcpy_fast=77ms memcpy=50 ms result(dst unalign, src unalign): memcpy_fast=76ms memcpy=55 ms
benchmark(size=4096 bytes, times=524288): result(dst aligned, src aligned): memcpy_fast=39ms memcpy=33 ms result(dst aligned, src unalign): memcpy_fast=47ms memcpy=33 ms result(dst unalign, src aligned): memcpy_fast=40ms memcpy=46 ms result(dst unalign, src unalign): memcpy_fast=45ms memcpy=49 ms
benchmark(size=8192 bytes, times=262144): result(dst aligned, src aligned): memcpy_fast=40ms memcpy=30 ms result(dst aligned, src unalign): memcpy_fast=49ms memcpy=31 ms result(dst unalign, src aligned): memcpy_fast=48ms memcpy=43 ms result(dst unalign, src unalign): memcpy_fast=48ms memcpy=43 ms
benchmark(size=1048576 bytes, times=2048): result(dst aligned, src aligned): memcpy_fast=82ms memcpy=68 ms result(dst aligned, src unalign): memcpy_fast=84ms memcpy=68 ms result(dst unalign, src aligned): memcpy_fast=82ms memcpy=67 ms result(dst unalign, src unalign): memcpy_fast=81ms memcpy=72 ms
benchmark(size=4194304 bytes, times=512): result(dst aligned, src aligned): memcpy_fast=114ms memcpy=110 ms result(dst aligned, src unalign): memcpy_fast=101ms memcpy=107 ms result(dst unalign, src aligned): memcpy_fast=103ms memcpy=103 ms result(dst unalign, src unalign): memcpy_fast=101ms memcpy=105 ms
benchmark(size=8388608 bytes, times=256): result(dst aligned, src aligned): memcpy_fast=113ms memcpy=108 ms result(dst aligned, src unalign): memcpy_fast=100ms memcpy=107 ms result(dst unalign, src aligned): memcpy_fast=104ms memcpy=107 ms result(dst unalign, src unalign): memcpy_fast=101ms memcpy=107 ms
benchmark random access: memcpy_fast=647ms memcpy=593ms
parameters needs to be tuned.
What parameters?
Using vs2017 I got similar results...I set the -arch option as indicated, and the optimization to O2.
Also look at the comment: https://github.com/skywind3000/FastMemcpy/issues/6#issuecomment-723501146 Probably it's actually slower but we cannot be sure from the builtin benchmark.
Same here with gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
and glibc 2.27.
benchmark(size=32 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=230ms memcpy=63 ms
result(dst aligned, src unalign): memcpy_fast=218ms memcpy=63 ms
result(dst unalign, src aligned): memcpy_fast=218ms memcpy=62 ms
result(dst unalign, src unalign): memcpy_fast=216ms memcpy=64 ms
benchmark(size=64 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=311ms memcpy=64 ms
result(dst aligned, src unalign): memcpy_fast=314ms memcpy=63 ms
result(dst unalign, src aligned): memcpy_fast=312ms memcpy=62 ms
result(dst unalign, src unalign): memcpy_fast=318ms memcpy=64 ms
benchmark(size=512 bytes, times=8388608):
result(dst aligned, src aligned): memcpy_fast=701ms memcpy=86 ms
result(dst aligned, src unalign): memcpy_fast=695ms memcpy=80 ms
result(dst unalign, src aligned): memcpy_fast=757ms memcpy=88 ms
result(dst unalign, src unalign): memcpy_fast=761ms memcpy=82 ms
benchmark(size=1024 bytes, times=4194304):
result(dst aligned, src aligned): memcpy_fast=651ms memcpy=75 ms
result(dst aligned, src unalign): memcpy_fast=658ms memcpy=74 ms
result(dst unalign, src aligned): memcpy_fast=696ms memcpy=75 ms
result(dst unalign, src unalign): memcpy_fast=709ms memcpy=76 ms
benchmark(size=4096 bytes, times=524288):
result(dst aligned, src aligned): memcpy_fast=319ms memcpy=38 ms
result(dst aligned, src unalign): memcpy_fast=330ms memcpy=41 ms
result(dst unalign, src aligned): memcpy_fast=329ms memcpy=38 ms
result(dst unalign, src unalign): memcpy_fast=334ms memcpy=40 ms
benchmark(size=8192 bytes, times=262144):
result(dst aligned, src aligned): memcpy_fast=319ms memcpy=36 ms
result(dst aligned, src unalign): memcpy_fast=322ms memcpy=33 ms
result(dst unalign, src aligned): memcpy_fast=325ms memcpy=35 ms
result(dst unalign, src unalign): memcpy_fast=327ms memcpy=43 ms
benchmark(size=1048576 bytes, times=2048):
result(dst aligned, src aligned): memcpy_fast=351ms memcpy=91 ms
result(dst aligned, src unalign): memcpy_fast=356ms memcpy=90 ms
result(dst unalign, src aligned): memcpy_fast=363ms memcpy=91 ms
result(dst unalign, src unalign): memcpy_fast=360ms memcpy=90 ms
benchmark(size=4194304 bytes, times=512):
result(dst aligned, src aligned): memcpy_fast=331ms memcpy=93 ms
result(dst aligned, src unalign): memcpy_fast=335ms memcpy=94 ms
result(dst unalign, src aligned): memcpy_fast=339ms memcpy=97 ms
result(dst unalign, src unalign): memcpy_fast=339ms memcpy=99 ms
benchmark(size=8388608 bytes, times=256):
result(dst aligned, src aligned): memcpy_fast=333ms memcpy=98 ms
result(dst aligned, src unalign): memcpy_fast=355ms memcpy=115 ms
result(dst unalign, src aligned): memcpy_fast=357ms memcpy=116 ms
result(dst unalign, src unalign): memcpy_fast=349ms memcpy=117 ms
benchmark random access:
memcpy_fast=1953ms memcpy=635ms