kvrocks
kvrocks copied to clipboard
Add LTO/IPO compile support
It seems GCC builds in linux work well but Clang builds fail in linking phase (https://github.com/apache/incubator-kvrocks/runs/7631095274?check_suite_focus=true).
I have not tried it manually but I guess it maybe due to ld, which cannot recognize *.o
generated by Clang with -flto
enabled (it is actually LLVM bytecode in LTO mode, not native object file).
Hence you can try to replace ld with lld while clang is used. I can help you to do that if you do not familiar with those stuff.
It seems GCC builds in linux work well but Clang builds fail in linking phase (https://github.com/apache/incubator-kvrocks/runs/7631095274?check_suite_focus=true).
I have not tried it manually but I guess it maybe due to ld, which cannot recognize
*.o
generated by clang with-flto
enabled (it is actually llvm bytecode in LTO mode).Hence you can try to replace ld with lld while clang is used. I can help you to do that if you do not familiar with those stuff.
I'll fix it. wait for
clang not working. i cannot fix it.
It's fine. I will help you.
I'm going to try your branch on my side and commit to it. So please do not force-push from now :rofl:
I think we should add some descriptions for these options, why do we use, their advantage or disadvantages
I think we should add some descriptions for these options, why do we use, their advantage or disadvantages
Link-time optimization (LTO) is an important optimization technology in modern compilers.
As we already known, C/C++ compilers generate native object files for every translation unit (TU), which indicates compiler can only get information within a certain TU, i.e. it know nothing about another TU (except declarations). But obviously most compiler optimization approach requires the information in function definitions, like constant propagation, reachable analysis, interprocedual dataflow analysis, inlining, loop invariant analysis, pointer analysis, etc.
So if compiler cannot retrieve such information, the optimization will be just discarded. Obviously, this is a huge loss. LTO postpone the optimization procedures to link-time so (almost) every definition in the program is available to the optimizer. I do not think LTO has any disadvantages other than possibly slowing down compilation.
I do not know why CMake call it interprocedure optimization, since interprocedure optimization can be done without LTO (but limited since it loss information in definitions).
Hi everyone, is there any other thought on this PR? I will merge it if no further discussion 🚀
oh, thanks, is there a performance comparison?
I mean how it affects kvrocks? When we add new features or improvements, we must know their earnings, for example, if this build option make kvrocks performance better, we should know that and how much it improves. Otherwise, other contributors may add new optimization options, we also don't know which option bring earnings or why, we will loss the control of this project, we also can't answer the users's question how to optimize the performance with build options.
I don't say we will merge it only if this option must bring much obvious performance improvement, but we must know it clearly. In addition, not only for this PR changes, even if they are valid on another system, we also don't need to merge it if they don't bring earnings on kvrocks till now, since, for a long time, it will be more and more complex, and make it hard to maintain.
BTW, if you merge this commit, also please describe it clearly in commit log.
I mean how it affects kvrocks? When we add new features or improvements, we must know their earnings, for example, if this build option make kvrocks performance better, we should know that and how much it improves. Otherwise, other contributors may add new optimization options, we also don't know which option bring earnings or why, we will loss the control of this project, we also can't answer the users's question how to optimize the performance with build options.
I don't say we will merge it only if this option must bring much obvious performance improvement, but we must know it clearly. In addition, not only for this PR changes, even if they are valid on another system, we also don't need to merge it if they don't bring earnings on kvrocks till now, since, for a long time, it will be more and more complex, and make it hard to maintain.
BTW, if you merge this commit, also please describe it clearly in commit log.
Do we have benchmark test? Or some scripts to test performence easy? We could use those stuff test LTO/before LTO difference.
LTO is free lunch, just like change compiler from c++98 to c++11, c++17, more higher, more gains.
Hi @ShooterIT, I agree with your argument that benchmark matters.
I think we can wait for a benchmark before merge it or set ENABLE_IPO
to OFF on default since LTO is practical in lots of database projects like mysql, clickhouse and mongodb.
from studying of these options, i think they would bring performance earnings, but till now, we still don't know how much on kvrocks
for test, currently we can use redis-benchmark https://github.com/apache/incubator-kvrocks#2--qps-on-different-payloads
Closed as no consensus. @wanghenshui I suggest you create an issue first and share the performance benchmark so that our maintainers can be sure whether this change brings good.
before IPO
redis-benchmark -q -p 6666
ERROR: failed to fetch CONFIG from 127.0.0.1:6666
WARNING: Could not fetch server CONFIG
PING_INLINE: 185528.77 requests per second, p50=0.143 msec
PING_MBULK: 191204.59 requests per second, p50=0.135 msec
SET: 132802.12 requests per second, p50=0.343 msec
GET: 205338.81 requests per second, p50=0.127 msec
INCR: 124688.28 requests per second, p50=0.359 msec
LPUSH: 95693.78 requests per second, p50=0.455 msec
RPUSH: 101936.80 requests per second, p50=0.399 msec
LPOP: 90579.71 requests per second, p50=0.463 msec
RPOP: 84033.61 requests per second, p50=0.495 msec
SADD: 132100.39 requests per second, p50=0.303 msec
HSET: 133511.34 requests per second, p50=0.311 msec
SPOP: 224719.11 requests per second, p50=0.119 msec
ZADD: 180180.17 requests per second, p50=0.231 msec
ZPOPMIN: 289855.06 requests per second, p50=0.095 msec
LPUSH (needed to benchmark LRANGE): 100200.40 requests per second, p50=0.439 msec
LRANGE_100 (first 100 elements): 118063.76 requests per second, p50=0.231 msec
LRANGE_300 (first 300 elements): 65445.03 requests per second, p50=0.415 msec
LRANGE_500 (first 500 elements): 41876.05 requests per second, p50=0.647 msec
LRANGE_600 (first 600 elements): 37735.85 requests per second, p50=0.719 msec
MSET (10 keys): 21537.80 requests per second, p50=2.223 msec
after IPO
redis-benchmark -q -p 6666
ERROR: failed to fetch CONFIG from 127.0.0.1:6666
WARNING: Could not fetch server CONFIG
PING_INLINE: 233100.23 requests per second, p50=0.111 msec
PING_MBULK: 241545.89 requests per second, p50=0.111 msec
SET: 137741.05 requests per second, p50=0.319 msec
GET: 204498.98 requests per second, p50=0.127 msec
INCR: 119760.48 requests per second, p50=0.375 msec
LPUSH: 99108.03 requests per second, p50=0.439 msec
RPUSH: 101626.02 requests per second, p50=0.415 msec
LPOP: 86655.11 requests per second, p50=0.463 msec
RPOP: 97656.24 requests per second, p50=0.415 msec
SADD: 149476.83 requests per second, p50=0.311 msec
HSET: 163398.70 requests per second, p50=0.255 msec
SPOP: 158478.61 requests per second, p50=0.167 msec
ZADD: 136425.66 requests per second, p50=0.327 msec
ZPOPMIN: 242718.45 requests per second, p50=0.111 msec
LPUSH (needed to benchmark LRANGE): 93720.71 requests per second, p50=0.447 msec
LRANGE_100 (first 100 elements): 141442.72 requests per second, p50=0.183 msec
LRANGE_300 (first 300 elements): 69832.40 requests per second, p50=0.391 msec
LRANGE_500 (first 500 elements): 44247.79 requests per second, p50=0.607 msec
LRANGE_600 (first 600 elements): 37721.61 requests per second, p50=0.719 msec
MSET (10 keys): 23557.13 requests per second, p50=1.911 msec
after IPO optimize, redis quick bench result was worse, so don't merge it.
Thanks for @wanghenshui input, I'd like to have a try as well.
Update: I tested the ping/ping_bulk which won't be affected by the data volume, and this PR improve the performance from 54929.96
=> 57378.93
. It's worth reopening since we have the data to prove this.
The merge conflict is resolved now.
I think we can merge it now since the benchmark looks good.
I think we can merge it now since the benchmark looks good.
Yes
Thanks all, merging...
Thanks to @wanghenshui contribution again.