FastMemcpy
FastMemcpy copied to clipboard
GCC 10.2.1 Results
gcc version 10.2.1 20201007 releases/gcc-10.2.0-350-g136256c32d (Clear Linux OS for Intel Architecture)
./FastMemcpy benchmark(size=32 bytes, times=16777216): result(dst aligned, src aligned): memcpy_fast=48ms memcpy=35 ms result(dst aligned, src unalign): memcpy_fast=49ms memcpy=33 ms result(dst unalign, src aligned): memcpy_fast=49ms memcpy=34 ms result(dst unalign, src unalign): memcpy_fast=49ms memcpy=34 ms
benchmark(size=64 bytes, times=16777216): result(dst aligned, src aligned): memcpy_fast=54ms memcpy=34 ms result(dst aligned, src unalign): memcpy_fast=54ms memcpy=34 ms result(dst unalign, src aligned): memcpy_fast=54ms memcpy=34 ms result(dst unalign, src unalign): memcpy_fast=54ms memcpy=34 ms
benchmark(size=512 bytes, times=8388608): result(dst aligned, src aligned): memcpy_fast=85ms memcpy=56 ms result(dst aligned, src unalign): memcpy_fast=91ms memcpy=52 ms result(dst unalign, src aligned): memcpy_fast=93ms memcpy=56 ms result(dst unalign, src unalign): memcpy_fast=94ms memcpy=51 ms
benchmark(size=1024 bytes, times=4194304): result(dst aligned, src aligned): memcpy_fast=85ms memcpy=41 ms result(dst aligned, src unalign): memcpy_fast=91ms memcpy=43 ms result(dst unalign, src aligned): memcpy_fast=91ms memcpy=44 ms result(dst unalign, src unalign): memcpy_fast=90ms memcpy=44 ms
benchmark(size=4096 bytes, times=524288): result(dst aligned, src aligned): memcpy_fast=40ms memcpy=20 ms result(dst aligned, src unalign): memcpy_fast=44ms memcpy=20 ms result(dst unalign, src aligned): memcpy_fast=44ms memcpy=21 ms result(dst unalign, src unalign): memcpy_fast=44ms memcpy=20 ms
benchmark(size=8192 bytes, times=262144): result(dst aligned, src aligned): memcpy_fast=40ms memcpy=23 ms result(dst aligned, src unalign): memcpy_fast=43ms memcpy=23 ms result(dst unalign, src aligned): memcpy_fast=43ms memcpy=33 ms result(dst unalign, src unalign): memcpy_fast=43ms memcpy=34 ms
benchmark(size=1048576 bytes, times=2048): result(dst aligned, src aligned): memcpy_fast=54ms memcpy=43 ms result(dst aligned, src unalign): memcpy_fast=55ms memcpy=44 ms result(dst unalign, src aligned): memcpy_fast=55ms memcpy=47 ms result(dst unalign, src unalign): memcpy_fast=55ms memcpy=48 ms
benchmark(size=4194304 bytes, times=512): result(dst aligned, src aligned): memcpy_fast=88ms memcpy=70 ms result(dst aligned, src unalign): memcpy_fast=88ms memcpy=78 ms result(dst unalign, src aligned): memcpy_fast=89ms memcpy=74 ms result(dst unalign, src unalign): memcpy_fast=91ms memcpy=75 ms
benchmark(size=8388608 bytes, times=256): result(dst aligned, src aligned): memcpy_fast=96ms memcpy=90 ms result(dst aligned, src unalign): memcpy_fast=94ms memcpy=91 ms result(dst unalign, src aligned): memcpy_fast=95ms memcpy=91 ms result(dst unalign, src unalign): memcpy_fast=95ms memcpy=92 ms
benchmark random access: memcpy_fast=802ms memcpy=662ms
./FastMemcpy_Avx benchmark(size=32 bytes, times=16777216): result(dst aligned, src aligned): memcpy_fast=49ms memcpy=29 ms result(dst aligned, src unalign): memcpy_fast=49ms memcpy=29 ms result(dst unalign, src aligned): memcpy_fast=49ms memcpy=30 ms result(dst unalign, src unalign): memcpy_fast=49ms memcpy=29 ms
benchmark(size=64 bytes, times=16777216): result(dst aligned, src aligned): memcpy_fast=49ms memcpy=29 ms result(dst aligned, src unalign): memcpy_fast=49ms memcpy=29 ms result(dst unalign, src aligned): memcpy_fast=49ms memcpy=30 ms result(dst unalign, src unalign): memcpy_fast=49ms memcpy=29 ms
benchmark(size=512 bytes, times=8388608): result(dst aligned, src aligned): memcpy_fast=64ms memcpy=56 ms result(dst aligned, src unalign): memcpy_fast=64ms memcpy=51 ms result(dst unalign, src aligned): memcpy_fast=66ms memcpy=56 ms result(dst unalign, src unalign): memcpy_fast=66ms memcpy=52 ms
benchmark(size=1024 bytes, times=4194304): result(dst aligned, src aligned): memcpy_fast=43ms memcpy=41 ms result(dst aligned, src unalign): memcpy_fast=44ms memcpy=43 ms result(dst unalign, src aligned): memcpy_fast=44ms memcpy=44 ms result(dst unalign, src unalign): memcpy_fast=44ms memcpy=44 ms
benchmark(size=4096 bytes, times=524288): result(dst aligned, src aligned): memcpy_fast=20ms memcpy=19 ms result(dst aligned, src unalign): memcpy_fast=22ms memcpy=21 ms result(dst unalign, src aligned): memcpy_fast=21ms memcpy=21 ms result(dst unalign, src unalign): memcpy_fast=21ms memcpy=21 ms
benchmark(size=8192 bytes, times=262144): result(dst aligned, src aligned): memcpy_fast=21ms memcpy=23 ms result(dst aligned, src unalign): memcpy_fast=22ms memcpy=23 ms result(dst unalign, src aligned): memcpy_fast=22ms memcpy=34 ms result(dst unalign, src unalign): memcpy_fast=22ms memcpy=33 ms
benchmark(size=1048576 bytes, times=2048): result(dst aligned, src aligned): memcpy_fast=90ms memcpy=45 ms result(dst aligned, src unalign): memcpy_fast=90ms memcpy=45 ms result(dst unalign, src aligned): memcpy_fast=89ms memcpy=48 ms result(dst unalign, src unalign): memcpy_fast=88ms memcpy=48 ms
benchmark(size=4194304 bytes, times=512): result(dst aligned, src aligned): memcpy_fast=88ms memcpy=72 ms result(dst aligned, src unalign): memcpy_fast=92ms memcpy=79 ms result(dst unalign, src aligned): memcpy_fast=88ms memcpy=76 ms result(dst unalign, src unalign): memcpy_fast=87ms memcpy=77 ms
benchmark(size=8388608 bytes, times=256): result(dst aligned, src aligned): memcpy_fast=95ms memcpy=91 ms result(dst aligned, src unalign): memcpy_fast=98ms memcpy=92 ms result(dst unalign, src aligned): memcpy_fast=94ms memcpy=91 ms result(dst unalign, src unalign): memcpy_fast=95ms memcpy=95 ms
benchmark random access: memcpy_fast=796ms memcpy=687ms
The benchmark is not quite correctly implemented for the following reasons:
- Compiler can easily do constant propagation of
sizeparameter and then replace memcpy to builtin for small sizes. The benchmark function should be marked as noinline. Even more, "function cloning" optimization should be disabled. - It's not enough to test with power of two sizes because "tails" processing is not taken into account.
- When you use the original memcpy, the code from glibc is used. It is compiled separately by OS maintainers and it does not depend on your compiler. But it depends on your machine (dynamic dispatch on supported instruction set is performed). And you did not provide the info on your machine. Actually it should be tested on a multitude of different CPUs.
- Testing in a loop with the same size is misrepresentative because branches will be predictable.
Bottomline: the library is probably Ok but the benchmark is a nonsense.
Bottomline: the library is probably Ok but the benchmark is a nonsense.
Just post the right benchmark code ?
another variable here is that it's a false assumption (at least one i had myself) that standard libraries aren't using vector instructions.
I read some of the libc source code, and they use handwritten AVX2 for memcpy, memcmp and a few others when the architecture supports it. And i tested this on a machine that maxed at AVX2 instructions. So that could easily explain these results.
(And they had comments in there that they don't implement AVX512 because they've experimented and determined that the frequency downgrade is detrimental to overall application performance.)
also even if the benchmark might not be ideal, it's still legitimate and shows that head to head performance it at least one subset of all possible implementations (whether it captures a realistic pattern or not idk?). but ya what @zhanglistar basically said, we'd all love to flip the tables on libc again!
I have run ClickHouse performance test and can confirm that glibc's memcpy is better than FastMemcpy (at least on one machine):
https://clickhouse-test-reports.s3.yandex.net/17111/213266b80cbc1489b411929568bd9cc8c8173c8d/performance_comparison/report.html#fail1
Although the mean difference is very small: 0.5%.
Maximum speedup (that I'm confident) is about 16% on the following query:
SELECT count() FROM zeros(1000000) WHERE NOT ignore(materialize('xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx') AS s, concat(s,s,s,s,s,s,s,s,s,s) AS t, concat(t,t,t,t,t,t,t,t,t,t) AS u) SETTINGS max_block_size = 1000
that is very memcpy-heavy (see these concats).
We have to continue using custom memcpy instead of glibc's to maintain compatibility with old glibc.
ice lake is bringing us these goodies! (less frequency downscaling).
