valkey icon indicating copy to clipboard operation
valkey copied to clipboard

Cache CLUSTER SLOTS response for improving throughput and reduced latency.

Open roshkhatri opened this issue 10 months ago • 12 comments

This PR adds a logic to cache CLUSTER SLOTS response for reduced latency and also update the cache when a change in the cluster is detected.

Historically, CLUSTER SLOTS command was deprecated, however all the server clients have been using CLUSTER SLOTS and have not migrated to CLUSTER SHARDS. In future this logic can be added to any other commands to improve the performance of the engine.

To compare the performance gain this PR has to offer, I have ran benchmarks for 2 scenarios with 2 primaries in the cluster:

  1. Best case scenario - Continuous slot ownership where each primaries own slots 0-8191 and 8192-16383 respectively.
  2. Worst case scenario - Total fragmented slot ownership, where one primary owns odd numbered slots and other owns even numbered slots.

Complete Benchmark results:

  1. For BEST CASE we can see a gain of 76% in throughput from 21k to 37k RPS and 2X AVG latency drop from 0.044msec to 0.021msec for 100000 requests.

BEST CASE unstable branch:

% src/placeholderkv-benchmark -n 100000 -c 1 CLUSTER SLOTS
====== CLUSTER SLOTS ======                                                   
  100000 requests completed in 4.60 seconds
  1 parallel clients
  28 bytes payload
  keep alive: 1
  host configuration "save": 
  host configuration "appendonly": no
  multi-thread: no

Latency by percentile distribution:
0.000% <= 0.039 milliseconds (cumulative count 1)
50.000% <= 0.047 milliseconds (cumulative count 99148)
99.219% <= 0.055 milliseconds (cumulative count 99740)
99.805% <= 0.063 milliseconds (cumulative count 99873)
99.902% <= 0.127 milliseconds (cumulative count 99916)
99.951% <= 0.167 milliseconds (cumulative count 99955)
99.976% <= 0.223 milliseconds (cumulative count 99980)
99.988% <= 0.239 milliseconds (cumulative count 99988)
99.994% <= 0.287 milliseconds (cumulative count 99994)
99.997% <= 0.303 milliseconds (cumulative count 99998)
99.998% <= 0.311 milliseconds (cumulative count 99999)
99.999% <= 0.631 milliseconds (cumulative count 100000)
100.000% <= 0.631 milliseconds (cumulative count 100000)

Cumulative distribution of latencies:
99.892% <= 0.103 milliseconds (cumulative count 99892)
99.966% <= 0.207 milliseconds (cumulative count 99966)
99.998% <= 0.303 milliseconds (cumulative count 99998)
99.999% <= 0.407 milliseconds (cumulative count 99999)
100.000% <= 0.703 milliseconds (cumulative count 100000)

Summary:
  throughput summary: 21724.96 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.044     0.032     0.047     0.047     0.047     0.631

BEST CASE This PR:

% src/placeholderkv-benchmark -n 100000 -c 1 CLUSTER SLOTS
====== CLUSTER SLOTS ======                                                   
  100000 requests completed in 2.67 seconds
  1 parallel clients
  28 bytes payload
  keep alive: 1
  host configuration "save": 
  host configuration "appendonly": no
  multi-thread: no

Latency by percentile distribution:
0.000% <= 0.023 milliseconds (cumulative count 94098)
96.875% <= 0.031 milliseconds (cumulative count 99712)
99.805% <= 0.039 milliseconds (cumulative count 99899)
99.902% <= 0.047 milliseconds (cumulative count 99925)
99.951% <= 0.111 milliseconds (cumulative count 99962)
99.976% <= 0.191 milliseconds (cumulative count 99977)
99.988% <= 0.311 milliseconds (cumulative count 99988)
99.994% <= 0.399 milliseconds (cumulative count 99994)
99.997% <= 0.431 milliseconds (cumulative count 99997)
99.998% <= 0.471 milliseconds (cumulative count 99999)
99.999% <= 0.551 milliseconds (cumulative count 100000)
100.000% <= 0.551 milliseconds (cumulative count 100000)

Cumulative distribution of latencies:
99.947% <= 0.103 milliseconds (cumulative count 99947)
99.980% <= 0.207 milliseconds (cumulative count 99980)
99.986% <= 0.303 milliseconds (cumulative count 99986)
99.995% <= 0.407 milliseconds (cumulative count 99995)
99.999% <= 0.503 milliseconds (cumulative count 99999)
100.000% <= 0.607 milliseconds (cumulative count 100000)

Summary:
  throughput summary: 37439.16 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.021     0.016     0.023     0.031     0.031     0.551
  1. For WORST CASE we can see a gain of 24% in throughput from 49.07 to 61.15 RPS and 22X AVG latency drop from 4.026msec to 0.181msec for 1000 requests.

WORST CASE unstable - Single Benchmark:

% src/placeholderkv-benchmark -n 1000 -c 1 CLUSTER SLOTS
====== CLUSTER SLOTS ======                                             
  1000 requests completed in 20.38 seconds
  1 parallel clients
  28 bytes payload
  keep alive: 1
  host configuration "save": 
  host configuration "appendonly": no
  multi-thread: no

Latency by percentile distribution:
0.000% <= 3.983 milliseconds (cumulative count 10)
50.000% <= 4.007 milliseconds (cumulative count 679)
75.000% <= 4.015 milliseconds (cumulative count 768)
87.500% <= 4.087 milliseconds (cumulative count 889)
93.750% <= 4.103 milliseconds (cumulative count 955)
96.875% <= 4.119 milliseconds (cumulative count 970)
98.438% <= 4.167 milliseconds (cumulative count 986)
99.219% <= 4.727 milliseconds (cumulative count 993)
99.609% <= 5.127 milliseconds (cumulative count 997)
99.805% <= 5.367 milliseconds (cumulative count 999)
99.902% <= 5.463 milliseconds (cumulative count 1000)
100.000% <= 5.463 milliseconds (cumulative count 1000)

Cumulative distribution of latencies:
0.000% <= 0.103 milliseconds (cumulative count 0)
95.500% <= 4.103 milliseconds (cumulative count 955)
99.600% <= 5.103 milliseconds (cumulative count 996)
100.000% <= 6.103 milliseconds (cumulative count 1000)

Summary:
  throughput summary: 49.07 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        4.026     3.976     4.007     4.103     4.591     5.463

WORST CASE This PR - Single Benchmark:

src/placeholderkv-benchmark -n 1000 -c 1 CLUSTER SLOTS 
====== CLUSTER SLOTS ======                                             
  1000 requests completed in 16.35 seconds
  1 parallel clients
  28 bytes payload
  keep alive: 1
  host configuration "save": 
  host configuration "appendonly": no
  multi-thread: no

Latency by percentile distribution:
0.000% <= 0.159 milliseconds (cumulative count 3)
50.000% <= 0.175 milliseconds (cumulative count 804)
87.500% <= 0.183 milliseconds (cumulative count 924)
93.750% <= 0.191 milliseconds (cumulative count 955)
96.875% <= 0.199 milliseconds (cumulative count 969)
98.438% <= 0.263 milliseconds (cumulative count 988)
99.219% <= 0.271 milliseconds (cumulative count 998)
99.805% <= 0.831 milliseconds (cumulative count 999)
99.902% <= 6.935 milliseconds (cumulative count 1000)
100.000% <= 6.935 milliseconds (cumulative count 1000)

Cumulative distribution of latencies:
0.000% <= 0.103 milliseconds (cumulative count 0)
97.200% <= 0.207 milliseconds (cumulative count 972)
99.800% <= 0.303 milliseconds (cumulative count 998)
99.900% <= 0.903 milliseconds (cumulative count 999)
100.000% <= 7.103 milliseconds (cumulative count 1000)

Summary:
  throughput summary: 61.15 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.181     0.152     0.175     0.191     0.271     6.935

This seemed like benckmark was the bottleneck for worst case.

So I also tried running 5 benchmarks for Worst case scenario with UNSTABLE and this PR: All benchmark were around 25rps and 17.8msec each and the CPU utilization of server went up to around 56% , while for this PR All benchmark rps was around 48rps and 0.28msec each and the CPU utilization of server went down to just 12%

Thats seems like 100% gain in RPS and 60X less latency and around 5X drop in CPU usage

for below images placeholderkv-s is the server and placeholderkv-b are the benchmarks

WORST CASE scenario unstable - 5 Benchmarks:

CPU utilizations: Screenshot 2024-03-27 at 12 47 59 PM Benchmarks: Screenshot 2024-03-27 at 12 49 31 PM

WORST CASE scenario PR - 5 Benchmarks:

CPU utilizations: Screenshot 2024-03-27 at 12 54 43 PM Benchmarks: Screenshot 2024-03-27 at 12 55 32 PM

CONCLUSION:

  1. For BEST CASE : 76% gain in throughput and 2X AVG latency drop
  2. For WORST CASE: 100% gain in throughput and around 60X AVG latency drop and 5X drop in CPU usage

roshkhatri avatar Mar 27 '24 20:03 roshkhatri

Added Sign-off to the PR

roshkhatri avatar Mar 27 '24 20:03 roshkhatri

@roshkhatri There is a memory sanitization error. Looks like we're leaking some memory now.

madolson avatar May 10 '24 21:05 madolson

Codecov Report

Attention: Patch coverage is 90.21739% with 9 lines in your changes are missing coverage. Please review.

Project coverage is 69.83%. Comparing base (72f2a87) to head (6c97434). Report is 4 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable      #53      +/-   ##
============================================
+ Coverage     69.67%   69.83%   +0.16%     
============================================
  Files           109      109              
  Lines         61801    61876      +75     
============================================
+ Hits          43062    43214     +152     
+ Misses        18739    18662      -77     
Files Coverage Δ
src/config.c 77.87% <100.00%> (+0.05%) :arrow_up:
src/connection.h 93.58% <ø> (ø)
src/networking.c 85.22% <100.00%> (+0.27%) :arrow_up:
src/cluster_legacy.c 86.17% <90.90%> (-0.07%) :arrow_down:
src/cluster.c 85.85% <82.35%> (-0.43%) :arrow_down:

... and 10 files with indirect coverage changes

codecov[bot] avatar May 10 '24 22:05 codecov[bot]

I think the one open question left is @PingXie concern about using the original client vs a new caching client, so would like to close that before merging.

@roshkhatri had tried the original client approach first. The change becomes a bit complex due to maintaining the start index of the clientReplyBlock as well as the start index within it. With the new caching client, all the offset bookkeeping becomes unnecessary as it is guaranteed the buffer would be empty.

hpatro avatar May 13 '24 14:05 hpatro

had tried the original client approach first. The change becomes a bit complex due to maintaining the start index of the clientReplyBlock as well as the start index within it. With the new caching client, all the offset bookkeeping becomes unnecessary as it is guaranteed the buffer would be empty.

I would like to see more information about this. I'm still not really convinced this is true.

madolson avatar May 13 '24 17:05 madolson

we add a dummy node using addReplyDeferredLen where we start our command reply. However, after we finish adding the command reply to reply list and we want to add the length of the reply we optimize using setDeferredReply https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L762, which makes it complex to maintain the start node and the address where we start the command output. This implementation of using a caching client is cleaner and reusable without having to touch a lot of core networking code.

roshkhatri avatar May 13 '24 18:05 roshkhatri

However, after we finish adding the command reply to reply list and we want to add the length of the reply we optimize using setDeferredReply https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L762, which makes it complex to maintain the start node and the address where we start the command output.

I don't think this is where the complexity is stemming from. My original concern was that there are a lot of other edge cases around client that could possibly corrupt the output such as disabling the client reply or hitting the CoB limits. Having a dedicated client side-steps a lot of this complexity since we control that secondary client.

madolson avatar May 13 '24 18:05 madolson

However, after we finish adding the command reply to reply list and we want to add the length of the reply we optimize using setDeferredReply https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L762, which makes it complex to maintain the start node and the address where we start the command output.

I don't think this is where the complexity is stemming from. My original concern was that there are a lot of other edge cases around client that could possibly corrupt the output such as disabling the client reply or hitting the CoB limits. Having a dedicated client side-steps a lot of this complexity since we control that secondary client.

Yes, that's also one of the point we had discussed internally 👍 . I'm aligned with current approach if there is no strong concern would like to merge this in.

@PingXie @madolson

hpatro avatar May 13 '24 19:05 hpatro

had tried the original client approach first. The change becomes a bit complex due to maintaining the start index of the clientReplyBlock as well as the start index within it. With the new caching client, all the offset bookkeeping becomes unnecessary as it is guaranteed the buffer would be empty.

I would like to see more information about this. I'm still not really convinced this is true.

@roshkhatri Should be able to share the diff.

hpatro avatar May 13 '24 20:05 hpatro

I think the one open question left is @PingXie concern about using the original client vs a new caching client, so would like to close that before merging.

Have we considered using a dedicated client like the fake AOF client? That would still you the "clean" slate but reduce the work associated with creating/deleting the client. Not that I am concerned about the performance but just wanted to see if there is any option to do less.

PingXie avatar May 14 '24 05:05 PingXie

However, after we finish adding the command reply to reply list and we want to add the length of the reply we optimize using setDeferredReply https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L762, which makes it complex to maintain the start node and the address where we start the command output.

I don't think this is where the complexity is stemming from. My original concern was that there are a lot of other edge cases around client that could possibly corrupt the output such as disabling the client reply or hitting the CoB limits. Having a dedicated client side-steps a lot of this complexity since we control that secondary client.

Yes, that's also one of the point we had discussed internally 👍 . I'm aligned with current approach if there is no strong concern would like to merge this in.

@PingXie @madolson

This one-off client is not a blocker to me though I still prefer a global fake client. I think I still have a few comments not addressed. @roshkhatri can you please take a look?

PingXie avatar May 14 '24 05:05 PingXie

Have we considered using a dedicated client like the fake AOF client? That would still you the "clean" slate but reduce the work associated with creating/deleting the client. Not that I am concerned about the performance but just wanted to see if there is any option to do less.

I liked the clean slate because we've had historical issues with the AOF client not properly resetting state and causing weird issues. I don't feel strongly here though.

madolson avatar May 14 '24 17:05 madolson