valkey icon indicating copy to clipboard operation
valkey copied to clipboard

Adjust io-threads number in runtime to fully utilize multi threads and make CPU more efficient.

Open zhulipeng opened this issue 1 year ago • 9 comments

Description

This patch try to fix the issue when io-threads number is configured, io-threads will be stopped according below logic, this will cause the io-threads useless and longer latency. According to existing logic, the io-threads will be activated by a hardcode threshold value, this will cause QPS regression by the different configured io-threads number even the same clients requests.

int stopThreadedIOIfNeeded(void) {
    int pending = listLength(server.clients_pending_write);

    /* Return ASAP if IO threads are disabled (single threaded mode). */
    if (server.io_threads_num == 1) return 1;

    if (pending < (server.io_threads_num*2)) {
        if (server.io_threads_active) stopThreadedIO();
        return 1;
    } else {
        return 0;
    }
}

With this patch, it will dynamically adjust the /IO threads number in runtime according to real workloads to tradeoff between the latency and CPU efficiency. I use decay rate formula based on clients_pending_writes to make throughput stable.

Benchmark Result

Test Environment

  • OS: CentOS Stream 8
  • Kernel: 6.2.0
  • Platform: Intel Xeon Platinum 8380
  • Base commit: b2a397366b1704b03f5042395ff620cd2096f598

Test Steps

Start Valkey-Server

Start valkey-server taskset -c 0-3 ~/src/valkey-server /tmp/valkey_1.conf with below config.

port 9001
bind * -::*
daemonize yes
protected-mode no
save ""
io-threads 4
io-threads-do-reads yes

Test from throughput perspective

Start valkey-benchmark in localhost with below commands: taskset -c 4-7 ./src/valkey-benchmark -p 9001 -t set,get -d 128 -r 5000000 -n 10000000 --threads 2 -c 10 to measure the SET GET performance gap correspondingly, results are listed as below: we can find that the io-threads is not used due to clients_pending_write < (server.io_threads_num*2), the throughput is limited by only single thread even more CPU resources were allocated.

NOTE: QPS will use relative value instead of absolute value. CPU utilized: perf stat -p `pidof valkey-server` sleep 5

Base CPU utilized Opt CPU utilized QPS Gain(Opt/Base-1)
SET 1 2 23%
GET 1 2 25%

Test from CPU efficiency perspective

If we increase the requests number by below commands to make origin io-threads work, throughput increased as expected, but all CPU are fully utilized. With this patch, we can make CPU more efficiency by reducing the active io-threads number. taskset -c 4-7 ./src/valkey-benchmark -p 9001 -t set,get -d 128 -r 5000000 -n 10000000 --threads 2 -c 15

Base CPU utilized Opt CPU utilized QPS Gain(Opt/Base-1)
SET 3.999 2.908 8%
GET 3.999 2.918 3%

zhulipeng avatar Apr 01 '24 03:04 zhulipeng

@valkey-io/core-team Could you help to review this patch?

zhulipeng avatar Apr 04 '24 13:04 zhulipeng

@lipzhu I noticed there wasn't a test, we should at least add one for that.

madolson avatar Apr 04 '24 18:04 madolson

@madolson I am thinking how to add the unit tests for io-threads, it's for performance purpose, I am uncertain how to simulate it in unit tests, can you provide some guidance?

zhulipeng avatar Apr 07 '24 07:04 zhulipeng

Ping @madolson .

zhulipeng avatar Apr 11 '24 06:04 zhulipeng

My understanding of this change is that it helps with the volatile workload case where request rates fluctuate quite a bit. This change doesn't allocate/activate more IO worker threads at runtime but it allows the IO worker threads to stay active a bit longer using the Exponential Moving Average (EMA). This trades off a bit of CPU efficiency for a more consistent latency through these workload spikes/troughs, potentially significant.

Overall I am onboard with this change. I think the only thing I am looking forward to is a test that proves we can still activate all the threads configured via io-threads and the IO threads can indeed quiesce.

PingXie avatar Apr 24 '24 05:04 PingXie

I am thinking how to add the unit tests for io-threads, it's for performance purpose, I am uncertain how to simulate it in unit tests, can you provide some guidance?

I don't think we need a (unit) test for performance. We will use release candidates to get feedback. What we really need is a correctness test as I mentioned above.

PingXie avatar Apr 24 '24 05:04 PingXie

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 69.89%. Comparing base (6cff0d6) to head (1eb7c54). Report is 1 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #111      +/-   ##
============================================
+ Coverage     68.91%   69.89%   +0.98%     
============================================
  Files           109      109              
  Lines         61792    61789       -3     
============================================
+ Hits          42581    43186     +605     
+ Misses        19211    18603     -608     
Files Coverage Δ
src/networking.c 91.61% <100.00%> (+6.71%) :arrow_up:
src/server.c 88.61% <100.00%> (+<0.01%) :arrow_up:

... and 24 files with indirect coverage changes

codecov[bot] avatar Apr 26 '24 02:04 codecov[bot]

@PingXie I added integration test to validate the io_threads_active_num EMA logic working properly. Could you help to review again?

zhulipeng avatar Apr 28 '24 10:04 zhulipeng

What is the next process, is it ready to merge? @PingXie

zhulipeng avatar May 07 '24 08:05 zhulipeng