kvrocks Add LTO/IPO compile support

Aug 02 '22 08:08 wanghenshui

It seems GCC builds in linux work well but Clang builds fail in linking phase (https://github.com/apache/incubator-kvrocks/runs/7631095274?check_suite_focus=true).

I have not tried it manually but I guess it maybe due to ld, which cannot recognize *.o generated by Clang with -flto enabled (it is actually LLVM bytecode in LTO mode, not native object file).

Hence you can try to replace ld with lld while clang is used. I can help you to do that if you do not familiar with those stuff.

Aug 02 '22 12:08 PragmaTwice

It seems GCC builds in linux work well but Clang builds fail in linking phase (https://github.com/apache/incubator-kvrocks/runs/7631095274?check_suite_focus=true).

I have not tried it manually but I guess it maybe due to ld, which cannot recognize *.o generated by clang with -flto enabled (it is actually llvm bytecode in LTO mode).

Hence you can try to replace ld with lld while clang is used. I can help you to do that if you do not familiar with those stuff.

I'll fix it. wait for

Aug 02 '22 12:08 wanghenshui

clang not working. i cannot fix it.

It's fine. I will help you.

I'm going to try your branch on my side and commit to it. So please do not force-push from now :rofl:

Aug 03 '22 08:08 PragmaTwice

I think we should add some descriptions for these options, why do we use, their advantage or disadvantages

Aug 03 '22 11:08 ShooterIT

I think we should add some descriptions for these options, why do we use, their advantage or disadvantages

Link-time optimization (LTO) is an important optimization technology in modern compilers.

As we already known, C/C++ compilers generate native object files for every translation unit (TU), which indicates compiler can only get information within a certain TU, i.e. it know nothing about another TU (except declarations). But obviously most compiler optimization approach requires the information in function definitions, like constant propagation, reachable analysis, interprocedual dataflow analysis, inlining, loop invariant analysis, pointer analysis, etc.

So if compiler cannot retrieve such information, the optimization will be just discarded. Obviously, this is a huge loss. LTO postpone the optimization procedures to link-time so (almost) every definition in the program is available to the optimizer. I do not think LTO has any disadvantages other than possibly slowing down compilation.

I do not know why CMake call it interprocedure optimization, since interprocedure optimization can be done without LTO (but limited since it loss information in definitions).

Aug 03 '22 12:08 PragmaTwice

Hi everyone, is there any other thought on this PR? I will merge it if no further discussion 🚀

Aug 04 '22 10:08 PragmaTwice

oh, thanks, is there a performance comparison?

Aug 05 '22 01:08 ShooterIT

oh, thanks, is there a performance comparison?

You can check this review and this paper.

Aug 05 '22 04:08 PragmaTwice

I mean how it affects kvrocks? When we add new features or improvements, we must know their earnings, for example, if this build option make kvrocks performance better, we should know that and how much it improves. Otherwise, other contributors may add new optimization options, we also don't know which option bring earnings or why, we will loss the control of this project, we also can't answer the users's question how to optimize the performance with build options.

I don't say we will merge it only if this option must bring much obvious performance improvement, but we must know it clearly. In addition, not only for this PR changes, even if they are valid on another system, we also don't need to merge it if they don't bring earnings on kvrocks till now, since, for a long time, it will be more and more complex, and make it hard to maintain.

BTW, if you merge this commit, also please describe it clearly in commit log.

Aug 05 '22 05:08 ShooterIT

I mean how it affects kvrocks? When we add new features or improvements, we must know their earnings, for example, if this build option make kvrocks performance better, we should know that and how much it improves. Otherwise, other contributors may add new optimization options, we also don't know which option bring earnings or why, we will loss the control of this project, we also can't answer the users's question how to optimize the performance with build options.

I don't say we will merge it only if this option must bring much obvious performance improvement, but we must know it clearly. In addition, not only for this PR changes, even if they are valid on another system, we also don't need to merge it if they don't bring earnings on kvrocks till now, since, for a long time, it will be more and more complex, and make it hard to maintain.

BTW, if you merge this commit, also please describe it clearly in commit log.

Do we have benchmark test? Or some scripts to test performence easy? We could use those stuff test LTO/before LTO difference.

LTO is free lunch, just like change compiler from c++98 to c++11, c++17, more higher, more gains.

Aug 05 '22 06:08 wanghenshui

Hi @ShooterIT, I agree with your argument that benchmark matters.

I think we can wait for a benchmark before merge it or set ENABLE_IPO to OFF on default since LTO is practical in lots of database projects like mysql, clickhouse and mongodb.

Aug 05 '22 07:08 PragmaTwice

from studying of these options, i think they would bring performance earnings, but till now, we still don't know how much on kvrocks

Aug 05 '22 07:08 ShooterIT

for test, currently we can use redis-benchmark https://github.com/apache/incubator-kvrocks#2--qps-on-different-payloads

Aug 05 '22 07:08 ShooterIT

Closed as no consensus. @wanghenshui I suggest you create an issue first and share the performance benchmark so that our maintainers can be sure whether this change brings good.

Nov 08 '22 08:11 tisonkun

before IPO

redis-benchmark -q -p 6666
ERROR: failed to fetch CONFIG from 127.0.0.1:6666
WARNING: Could not fetch server CONFIG
PING_INLINE: 185528.77 requests per second, p50=0.143 msec                    
PING_MBULK: 191204.59 requests per second, p50=0.135 msec                    
SET: 132802.12 requests per second, p50=0.343 msec                    
GET: 205338.81 requests per second, p50=0.127 msec                    
INCR: 124688.28 requests per second, p50=0.359 msec                    
LPUSH: 95693.78 requests per second, p50=0.455 msec                   
RPUSH: 101936.80 requests per second, p50=0.399 msec                   
LPOP: 90579.71 requests per second, p50=0.463 msec                   
RPOP: 84033.61 requests per second, p50=0.495 msec                   
SADD: 132100.39 requests per second, p50=0.303 msec                    
HSET: 133511.34 requests per second, p50=0.311 msec                    
SPOP: 224719.11 requests per second, p50=0.119 msec                    
ZADD: 180180.17 requests per second, p50=0.231 msec                    
ZPOPMIN: 289855.06 requests per second, p50=0.095 msec                    
LPUSH (needed to benchmark LRANGE): 100200.40 requests per second, p50=0.439 msec                    
LRANGE_100 (first 100 elements): 118063.76 requests per second, p50=0.231 msec                    
LRANGE_300 (first 300 elements): 65445.03 requests per second, p50=0.415 msec                   
LRANGE_500 (first 500 elements): 41876.05 requests per second, p50=0.647 msec                   
LRANGE_600 (first 600 elements): 37735.85 requests per second, p50=0.719 msec                   
MSET (10 keys): 21537.80 requests per second, p50=2.223 msec

after IPO

 redis-benchmark -q -p 6666
ERROR: failed to fetch CONFIG from 127.0.0.1:6666
WARNING: Could not fetch server CONFIG
PING_INLINE: 233100.23 requests per second, p50=0.111 msec                    
PING_MBULK: 241545.89 requests per second, p50=0.111 msec                    
SET: 137741.05 requests per second, p50=0.319 msec                    
GET: 204498.98 requests per second, p50=0.127 msec                    
INCR: 119760.48 requests per second, p50=0.375 msec                    
LPUSH: 99108.03 requests per second, p50=0.439 msec                   
RPUSH: 101626.02 requests per second, p50=0.415 msec                    
LPOP: 86655.11 requests per second, p50=0.463 msec                   
RPOP: 97656.24 requests per second, p50=0.415 msec                    
SADD: 149476.83 requests per second, p50=0.311 msec                    
HSET: 163398.70 requests per second, p50=0.255 msec                    
SPOP: 158478.61 requests per second, p50=0.167 msec                    
ZADD: 136425.66 requests per second, p50=0.327 msec                    
ZPOPMIN: 242718.45 requests per second, p50=0.111 msec                    
LPUSH (needed to benchmark LRANGE): 93720.71 requests per second, p50=0.447 msec                   
LRANGE_100 (first 100 elements): 141442.72 requests per second, p50=0.183 msec                    
LRANGE_300 (first 300 elements): 69832.40 requests per second, p50=0.391 msec                   
LRANGE_500 (first 500 elements): 44247.79 requests per second, p50=0.607 msec                   
LRANGE_600 (first 600 elements): 37721.61 requests per second, p50=0.719 msec                   
MSET (10 keys): 23557.13 requests per second, p50=1.911 msec

after IPO optimize, redis quick bench result was worse, so don't merge it.

Nov 19 '22 14:11 wanghenshui

Thanks for @wanghenshui input, I'd like to have a try as well.

Update: I tested the ping/ping_bulk which won't be affected by the data volume, and this PR improve the performance from 54929.96 => 57378.93. It's worth reopening since we have the data to prove this.

Nov 19 '22 14:11 git-hulk

The merge conflict is resolved now.

Nov 19 '22 15:11 PragmaTwice

I think we can merge it now since the benchmark looks good.

Nov 20 '22 10:11 PragmaTwice

I think we can merge it now since the benchmark looks good.

Yes

Nov 20 '22 11:11 git-hulk

Thanks all, merging...

Nov 20 '22 12:11 git-hulk

Thanks to @wanghenshui contribution again.

Nov 20 '22 12:11 git-hulk