valkey [NEW] set performance with io threads is significantly worse when clients use pipelining

The Problem

I was benchmarking get/set performance between different release versions and realized that io-threads performance is significantly worse when I test with pipelining=4. It seems that prefetching is not as effective when clients pipeline commands for some reason.

Desired Improvement

I think we should investigate why this is the case and improve prefetching with pipelining, since we recommend some amount of pipelining to get the best performance from Valkey.

I've tested three different value sizes now. Each test was run for 1 hour:

testing unstable with io-threads=9:

testing unstable with 512B data size and io-threads=9:

Feb 19 '25 19:02 rainsupreme

The one exception seems to be in the SADD case, which to me doesn't make a lot of sense to me, but OK.

@uriyage @ranshid Would you mind reviewing this as well. It looks like Valkey 8.1.Rc1 is better in all cases except the SET with 9 threads case. I'm not aware of any place we would expect a large regression, but maybe we should take a look a look to see if something else was introduced that is consuming CPU.

Feb 26 '25 00:02 madolson

I've rerun the tests, this time throwing out the first 5 minutes of warmup data then running valkey-benchmark over a 1 hour period. Hardware and key size are the same as before. Currently, it looks like GET performance increases with pipelining, but SET performance still suffers from pipelining, though unstable has a smaller penalty than 8.0.

Mar 10 '25 20:03 rainsupreme

I fixed these graphs also after realizing my previous get tests were 0% hit rate

Mar 13 '25 20:03 rainsupreme

Any update on this issue? I am using valkey 8.2 running on AWS Elasticache with 2 replica nodes, each of size r7g.large.

And I am seeing quite a lot of timeouts when using SADD (Especially when the set doesn't already exist) with pipelined batches of 50 items each; earlier, we were having much larger batch sizes. Reduced due to timeouts, but it's still not better.

Total load on the cluster is around 15K to 20K RPM.

Nov 17 '25 09:11 Dhruv-Garg79