redis icon indicating copy to clipboard operation
redis copied to clipboard

Performance degrade 7.0.3 vs 6.2.7

Open academe-01 opened this issue 1 year ago • 12 comments

Hi, I tested on different systems, but every time the results are the same, 7.x works worse than 6.x. (config attached)

redis-benchmark -q -n 1000000 --threads 64
image

redis_conf.txt

academe-01 avatar Jul 15 '22 02:07 academe-01

Hi there @academe-01, trying to reproduce as follow:

# spin server
taskset -c 0 `pwd`/src/redis-server --logfile redis-opt.log --save "" --daemonize yes

# benchmark
taskset -c 1-3 redis-benchmark --csv -n 1000000 --threads 3

I was able to measure ~=3.7% overhead, which was described/justified before in https://github.com/redis/redis/issues/10460

due to the following features:

Funtion %CPU time Note
updateClientMemUsage 1.50% (after the improvement of #10401 )
ACLCheckAllUserCommandPerm 1.20% #9974
updateCommandLatencyHistogram (can be disabled) 0.80% #9462

Heres the benchmark table shared on a run on my local lab (notice that ideally we should run this for longer and multiple time) :

test v6.2.7 v7.0.3 % change
PING_INLINE 148126 142837 -3.6%
PING_MBULK 153822 148126 -3.7%
SET 148126 142837 -3.6%
GET 153822 148148 -3.7%
INCR 153822 148148 -3.7%
LPUSH 142837 142837 0.0%
RPUSH 148126 142837 -3.6%
LPOP 148148 142837 -3.6%
RPOP 148148 142857 -3.6%
SADD 148148 142857 -3.6%
HSET 148104 142837 -3.6%
SPOP 153822 148126 -3.7%
ZADD 142837 137931 -3.4%
ZPOPMIN 153799 148148 -3.7%
LPUSH (needed to benchmark LRANGE) 142837 142837 0.0%
LRANGE_100 (first 100 elements) 83306 76894 -7.7%
LRANGE_300 (first 300 elements) 41639 37709 -9.4%
LRANGE_500 (first 500 elements) 27742 24807 -10.6%
LRANGE_600 (first 600 elements) 23776 21235 -10.7%
MSET (10 keys) 108073 105241 -2.6%

One interesting thing is that the LRANGE tests are presenting a larger drop. I'll check this with more detail during the following days and report back.

filipecosta90 avatar Jul 15 '22 03:07 filipecosta90

Hi @filipecosta90, my tests were made with with rdb disabled (save "") and I always pin redis to specific cores/numa nodes. Sorry I didn't mention it before. Also I have io-threads 8 Please re-run your tests using more threads/cores.

academe-01 avatar Jul 15 '22 03:07 academe-01

hi @filipecosta90 @academe-01 It's werid that when I test the commands provided by filipecosta90, the result of LRANGE* tests sometimes in unstale are more faster than v6.2.7. unstable

"LPUSH (needed to benchmark LRANGE)","10395.88","4.490","1.384","4.191","6.031","6.783","13.863"
"LRANGE_100 (first 100 elements)","9979.64","4.637","1.272","4.383","6.223","6.927","12.503"
"LRANGE_300 (first 300 elements)","9132.25","4.962","1.336","4.815","6.639","7.343","11.495"
"LRANGE_500 (first 500 elements)","8176.08","5.485","1.040","5.343","7.487","8.791","39.519"
"LRANGE_600 (first 600 elements)","7821.91","5.706","1.320","5.583","7.799","8.991","22.191"

6.2.7

"LRANGE_100 (first 100 elements)","9807.77","4.715","1.088","4.487","6.463","7.591","25.087"
"LRANGE_300 (first 300 elements)","9070.29","4.986","1.216","4.823","6.807","7.871","23.663"
"LRANGE_500 (first 500 elements)","8452.01","5.292","1.016","5.143","7.271","8.479","34.591"
"LRANGE_600 (first 600 elements)","7947.23","5.612","1.272","5.415","7.943","9.735","39.999"

sundb avatar Jul 15 '22 07:07 sundb

hi @filipecosta90 @academe-01 It's werid that when I test the commands provided by filipecosta90, the result of LRANGE* tests sometimes in unstale are more faster than v6.2.7. unstable

"LPUSH (needed to benchmark LRANGE)","10395.88","4.490","1.384","4.191","6.031","6.783","13.863"
"LRANGE_100 (first 100 elements)","9979.64","4.637","1.272","4.383","6.223","6.927","12.503"
"LRANGE_300 (first 300 elements)","9132.25","4.962","1.336","4.815","6.639","7.343","11.495"
"LRANGE_500 (first 500 elements)","8176.08","5.485","1.040","5.343","7.487","8.791","39.519"
"LRANGE_600 (first 600 elements)","7821.91","5.706","1.320","5.583","7.799","8.991","22.191"

6.2.7

"LRANGE_100 (first 100 elements)","9807.77","4.715","1.088","4.487","6.463","7.591","25.087"
"LRANGE_300 (first 300 elements)","9070.29","4.986","1.216","4.823","6.807","7.871","23.663"
"LRANGE_500 (first 500 elements)","8452.01","5.292","1.016","5.143","7.271","8.479","34.591"
"LRANGE_600 (first 600 elements)","7947.23","5.612","1.272","5.415","7.943","9.735","39.999"

@sundb notice that on unstable we have https://github.com/redis/redis/pull/10697, which is not present on v7.0.3 if I'm not mistaken.

filipecosta90 avatar Jul 15 '22 08:07 filipecosta90

@filipecosta90 #10697 does exist in v7.0.3. at the time 7.0.3 was release, everything in unstable was included in it.

oranagra avatar Jul 17 '22 06:07 oranagra

@filipecosta90 one thing i'm still puzzled about is how it is related to #10987. IIUC it's just that while you investigated this issue you came across that opportunity. do we know the reason for the regression reported here?

oranagra avatar Jul 19 '22 07:07 oranagra

@filipecosta90 one thing i'm still puzzled about is how it is related to #10987. IIUC it's just that while you investigated this issue you came across that opportunity.

yes. given I focused on LRANGE benchmarks due to the higher diff I bumped into #10987

do we know the reason for the regression reported here?

The following 7.0 features are know to take some CPU:

Funtion %CPU time Note
updateClientMemUsage 1.50% (after the improvement of #10401 )
ACLCheckAllUserCommandPerm 1.20% #9974
updateCommandLatencyHistogram (can be disabled) 0.80% #9462

but there is still a GAP to understand. I'm working on it

filipecosta90 avatar Jul 19 '22 08:07 filipecosta90

I run the redis-benchmark with -t lrange for 6.2.7 and 7.0.3 with a customized gprof. It samples variables while profiling. I notice the main difference is in function lookupKey in db.c. The samples on the variable key are dramatically different. All samples from 7.0.3 are zeroed. I attached the files here. Maybe they could help explain the performance degradation. lookupKey_val_6.2.7.txt lookupKey_val_7.0.2.txt

wenglingmei avatar Oct 04 '22 23:10 wenglingmei

@wenglingmei i don't understand what you mean by "customized gprof", and what are "samples on the variable key"

oranagra avatar Oct 06 '22 06:10 oranagra

@wenglingmei i don't understand what you mean by "customized gprof", and what are "samples on the variable key"

Hi @oranagra, thank you for the reply. I should have mentioned the customized gprof is my research project implemented based on gprof. To be simple, if you check the file, I recorded the value of the variable "key", which is the parameter in function "lookupKey" defined in src/db.c line 66 in redis-6.2.7.

In this test, many more variables including the parameter "key". My tool marks its values as a major difference between the two versions. If you want more detail, I would be happy to post it here.

wenglingmei avatar Oct 06 '22 16:10 wenglingmei

@wenglingmei i'm still not sure i understand what it does, in the 7.0.3 report we see it reports line 1691 and 1683 (not about lookupKey and no access to the "key" variable). could it simply be that there was some refactoring, and we added more wrapper functions around lookupKey, and / or an argument named key is mentioned more?

oranagra avatar Oct 06 '22 17:10 oranagra

~~@oranagra, we apply some techniques to collect local variables even though the function has called into its callees. We noticed the function "expireIfNeeded" is moved into the function "lookupKey" from the lookupKey*() family. Because we also sample the functions of the lookupKey*() family, e.g., "lookupKeyReadWithFlags", the values of variables "key" from them and their callee are also recorded( I did not paste them to the text file in the previous comment).~~

~~Considering all the values of the "key" from all the functions, In 6.2.7, still much fewer value samples of the "key" are zero compares to the one in 7.0.3. I wondered if the code refactoring increases the frequency of deleting a key for the benchmark.~~

Just check the benchmark source code and realize the keys are not attached with expiration time. Sorry for the 'stupid' guess. I will explore more components other than db.c.

wenglingmei avatar Oct 06 '22 18:10 wenglingmei

Hello, I also observed a performance drop when comparing Redis 6.2.7 with Redis 7.0.5 on MacOS (M1, arm64). I am running my Go application benchmarks:

> benchstat redis_6.txt redis_7.txt
name                      old time/op    new time/op    delta
RedisPublish_History-8    10.3µs ± 0%    13.3µs ± 1%  +28.74%  (p=0.008 n=5+5)
RedisAddPresence-8        3.67µs ± 3%    5.34µs ± 0%  +45.54%  (p=0.008 n=5+5)

So overall operation latency drop is significant.

  • RedisPublish_History - executes LUA script which contains several different commands (XADD, PUBLISH, EXPIRE, HSET, EXISTS)
  • RedisAddPresence - LUA with ZADD, HSET, EXPIRE

The benchmarking code is the same, the only difference is Redis server version benchrmarks are launched with.

In the case above Redis CPU usage was up to 100% during the load. I also decreased the load from the app to keep Redis under 100% CPU – and results are like this:

❯ benchstat redis_6.txt redis_7.txt
name                                old time/op    new time/op    delta
RedisPublish_History-8    11.3µs ± 0%    15.4µs ± 0%  +35.94%  (p=0.008 n=5+5)
RedisAddPresence-8        4.80µs ± 0%    7.06µs ± 0%  +47.02%  (p=0.008 n=5+5)

Also, worse in Redis 7 case.

Redis launched like this (6.2.7):

/opt/homebrew/opt/[email protected]/bin/redis-server --save ""

And like this (7.0.5):

redis-server --save ""

FZambia avatar Nov 12 '22 12:11 FZambia

@FZambia that's a pretty big difference which we didn't see in our own tests, can you maybe help us narrow it down? maybe by running redis-benchmark against your servers to see if it exposes it, and if not, try to reduce your workload to a minimal one that exposes it. how did you obtain / build these two servers? maybe this is a result of a different build setup / configuration?

oranagra avatar Nov 12 '22 16:11 oranagra

Since I am using LUA scripts then possibly the latency degradation is somehow related to them?

Running this:

redis-benchmark -n 1000000 eval 'redis.call("hset", "h1", "k", "v");redis.call("hset", "h2", "k", "v");redis.call("expire", "h1", 3600);redis.call("expire", "h2", 3600);return redis.call("ping")' 0

On Redis 6:

Summary:
  throughput summary: 175254.12 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.238     0.080     0.239     0.327     0.367     2.199

On Redis 7:

Summary:
  throughput summary: 126566.26 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.347     0.080     0.343     0.431     0.487     2.847

how did you obtain / build these two servers?

Both Redis versions installed from brew in my case (M1, arm64). Don't have access to Linux these days, possibly on Linux benchmarks may produce better results? If yes - then degradation on Darwin is not really a thing for me.

FZambia avatar Nov 13 '22 12:11 FZambia

@FZambia Use git bisect to determine and found that this performance degradation is caused by https://github.com/redis/redis/pull/10220

sundb avatar Nov 13 '22 14:11 sundb

@sundb i guess you mean that you already used bisect and found the offensive commit? i'm guessing this is related to the old optimization that this commit removed, but it's odd since @yoav-steinberg did post a benchmark showing it didn't have any impact.

oranagra avatar Nov 13 '22 14:11 oranagra

I tried to run my app benchmarks with Redis before #10220 and after, building Redis from source. Results are:

❯ benchstat before.txt after.txt
name                  old time/op    new time/op    delta
RedisAddPresence-8    4.65µs ± 0%    5.12µs ± 1%  +10.31%  (p=0.000 n=9+9)

So it definitely affected latency, but it's not the only source of the degradation I suppose, as for Redis 6.2.7 I observed 3.6µs of latency in my benchmarks. Unfortunately, can not build Redis from source from 6.2.7 tag on M1 ARM64 to confirm that latency observed from Redis 6.2.7 installed from brew, getting:

    CC adlist.o
    CC quicklist.o
    CC ae.o
    CC anet.o
    CC dict.o
make[1]: *** No rule to make target `functions.h', needed by `server.o'.  Stop.

FZambia avatar Nov 14 '22 19:11 FZambia

For redis-benchmark I previously sent:

redis-benchmark -n 1000000 eval 'redis.call("hset", "h1", "k", "v");redis.call("hset", "h2", "k", "v");redis.call("expire", "h1", 3600);redis.call("expire", "h2", 3600);return redis.call("ping")' 0

Before #10220:

Summary:
  throughput summary: 149499.17 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.289     0.080     0.287     0.375     0.415     3.039

After:

Summary:
  throughput summary: 138293.45 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.316     0.080     0.319     0.391     0.439     2.463

So it's slower, but again – it seems the overall difference between 6.2.7 and 7.0.5 accumulated with some other commits.

FZambia avatar Nov 14 '22 19:11 FZambia

@FZambia the reason you failed to build 6.2.7 was that when switching between branches, you often have to do make distclean i think it is important to compare thins that were built the same way, so that we rule out unexpected difference due to different build environments / flags.

I was also able to reproduce a big performance difference between 7.0 and 6.2 using your benchmark line (on x86_64 Linux). And didn't a similar big difference when testing the before and after of #10220. p.s. i think it's better to add -P 10 so that we are sure to measure the speed of redis and not the network / client round-trip latency.

oranagra avatar Nov 15 '22 06:11 oranagra

the reason you failed to build 6.2.7 was that when switching between branches, you often have to do make distclean

This helped, thanks!

For my Go app benchmark

6.2.7: 3.6µs latency before #10220: 4.7µs after #10220: 5.2µs 7.0.5: 5.1µs

For redis-benchmark

For redis-benchmark -n 1000000 -P 10 eval 'redis.call("hset", "h1", "k", "v");redis.call("hset", "h2", "k", "v");redis.call("expire", "h1", 3600);redis.call("expire", "h2", 3600);return redis.call("ping")' 0:

6.2.7:

Summary:
  throughput summary: 315059.84 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        1.529     0.568     1.519     1.631     2.015     5.919

before #10220:

Summary:
  throughput summary: 229515.73 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        2.120     0.896     2.111     2.207     2.551     8.015

after #10220:

Summary:
  throughput summary: 207641.20 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        2.348     1.120     2.335     2.447     2.751     8.535

7.0.5:

  throughput summary: 191681.05 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        2.538     0.912     2.535     2.631     2.895     8.775

As we can see latency just becoming worse as time goes, and #10220 is not the only reason for it.

FZambia avatar Nov 15 '22 07:11 FZambia

@FZambia / @oranagra I've added this benchmark to the redis benchmarks SPEC and in the next couple of hours we will have the profile info and numbers about 6.2.7 and 7.0.* version to detect where the issue is coming from and try to address it.

https://github.com/redis/redis-benchmarks-specification/blob/main/redis_benchmarks_specification/test-suites/memtier_benchmark-2keys-lua-eval-hset-expire.yml

I will follow up on this with more data tomorrow.

filipecosta90 avatar Nov 16 '22 22:11 filipecosta90

@oranagra @filipecosta90 It seems to be related to the following two commits about argv caches being removed. https://github.com/redis/redis/commit/48c49c485155ba9e4a7851fd1644c171674c6f0f https://github.com/redis/redis/commit/4f686555ce962e6632235d824512ea8fdeda003c

before #10220 image

#10220 image

sundb avatar Nov 17 '22 12:11 sundb

How does that sit with the previous statement that the majority of the degradation from 6.2 was before #10220? And the benchmark in the PR that shows the optimization doesn't contribute much?

oranagra avatar Nov 17 '22 14:11 oranagra

@sundb it seems that some of the degradation (half of it) is being caused by us calling 2x the evalCalcFunctionName when not required. It bumped 10% the numbers in unstable. WRT argv caches being removed I also see:

  •      + 3.21% luaArgsToRedisArgv                                                                                                                                                                       
    
  •      + 1.73% freeClientArgv
    

Indeed it might be worthed to reintroduce if possible that optimization @oranagra / @yoav-steinberg

filipecosta90 avatar Nov 17 '22 19:11 filipecosta90

@filipecosta90 It's better to change the benchmark command to redis-benchmark -n 1000000 -P 10 eval 'redis.call("hset", "h1", "k", "v");redis.call("hset", "h2", "k", "v");return redis.call("ping")' 0. #8474 affacts the EXPIRE command, resulting in ~%10 reduction in benchmark performance. This is mainly due to the consumption of rewriteClientCommandVector in #8474

#8474: P50=2.239 expire_after

before #8474: P50=2.479 expire_before

sundb avatar Nov 18 '22 08:11 sundb

~#8015 results in 15% performance degration.~ ~Since server.repl_backlog will be created even if there is no slave, all commands propagate will be wasted.~ ~ping @soloestoy~ Already fixed in #9166

sundb avatar Nov 18 '22 10:11 sundb

Already fixed in #9166

This means we just need to re-enable argv caches correct? @sundb / @oranagra / @yoav-steinberg do you want me to try that on #11521?

filipecosta90 avatar Nov 18 '22 13:11 filipecosta90

@filipecosta90 There are other modifications that can have performance degradation, but they may be unavoidable. #9819 -~1.5% #9780 -~3 I need more work to check.

sundb avatar Nov 18 '22 14:11 sundb

This means we just need to re-enable argv caches correct? @sundb / @oranagra / @yoav-steinberg do you want me to try that on #11521?

yes, if we see that this optimization had value, let's bring it back. i guess we better do that in a separate PR (since they're unrelated, no need to delay the first one)

oranagra avatar Nov 20 '22 08:11 oranagra