rueidis icon indicating copy to clipboard operation
rueidis copied to clipboard

perf: improve user specified flush delay automatically by statistics

Open rueian opened this issue 3 years ago • 11 comments

Choose flush delay automatically by statistics from previous flush histories.

It generally achieves better throughput than just using MaxFlushDelay provided by users directly while still keeping reasonable CPU usage.


Benchmark comparison to the old v0.0.89 by using the code https://github.com/FZambia/pipelines:

PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=100

▶ PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=100 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > old-100us-p1.txt
▶ PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=100 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > new-100us-p1.txt
▶ benchstat old-100us-p1.txt new-100us-p1.txt
name        old time/op    new time/op    delta
Rueidis-10    12.0µs ± 0%     3.8µs ± 2%  -68.47%  (p=0.000 n=20+20)

name        old alloc/op   new alloc/op   delta
Rueidis-10     80.0B ± 0%     80.0B ± 0%     ~     (all equal)

name        old allocs/op  new allocs/op  delta
Rueidis-10      1.00 ± 0%      1.00 ± 0%     ~     (all equal)

PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=100

▶ PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=100 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > old-100us-p128.txt
▶ PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=100 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > new-100us-p128.txt
▶ benchstat old-100us-p128.txt new-100us-p128.txt
name        old time/op    new time/op    delta
Rueidis-10     601ns ± 2%     589ns ± 2%  -1.89%  (p=0.000 n=20+19)

name        old alloc/op   new alloc/op   delta
Rueidis-10     80.0B ± 0%     80.0B ± 0%    ~     (all equal)

name        old allocs/op  new allocs/op  delta
Rueidis-10      1.00 ± 0%      1.00 ± 0%    ~     (all equal)

PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=20

▶ PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=20 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > old-20us-p1.txt
▶ PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=20 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > new-20us-p1.txt
▶ benchstat old-20us-p1.txt new-20us-p1.txt
name        old time/op    new time/op    delta
Rueidis-10    4.43µs ± 1%    3.80µs ± 2%  -14.38%  (p=0.000 n=20+20)

name        old alloc/op   new alloc/op   delta
Rueidis-10     80.0B ± 0%     80.0B ± 0%     ~     (all equal)

name        old allocs/op  new allocs/op  delta
Rueidis-10      1.00 ± 0%      1.00 ± 0%     ~     (all equal)

PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=20

▶ PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=20 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > old-20us-p128.txt
▶ PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=20 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > new-20us-p128.txt
▶ benchstat old-20us-p128.txt new-20us-p128.txt
name        old time/op    new time/op    delta
Rueidis-10     586ns ± 3%     587ns ± 3%   ~     (p=0.622 n=20+19)

name        old alloc/op   new alloc/op   delta
Rueidis-10     80.0B ± 0%     80.0B ± 0%   ~     (all equal)

name        old allocs/op  new allocs/op  delta
Rueidis-10      1.00 ± 0%      1.00 ± 0%   ~     (all equal)

rueian avatar Dec 14 '22 15:12 rueian

Codecov Report

Base: 96.19% // Head: 96.17% // Decreases project coverage by -0.01% :warning:

Coverage data is based on head (ba3adb3) compared to base (28cad24). Patch coverage: 73.07% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #159      +/-   ##
==========================================
- Coverage   96.19%   96.17%   -0.02%     
==========================================
  Files          36       36              
  Lines       25382    25401      +19     
==========================================
+ Hits        24415    24429      +14     
- Misses        846      851       +5     
  Partials      121      121              
Impacted Files Coverage Δ
pipe.go 99.13% <73.07%> (-0.62%) :arrow_down:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Dec 14 '22 15:12 codecov[bot]

related to https://github.com/rueian/rueidis/issues/156

rueian avatar Dec 14 '22 16:12 rueian

Hello @rueian - this looks interesting, some initial thoughts after looking at PR (have not even run it yet):

  • Is it possible to describe the logic behind this formula?
  • I was thinking on a topic a bit – to be fair I am a bit nervous at the moment to enable non-deterministic connection behaviour with auto adjustment – reasonable CPU is a bit subjective matter, so I think automatic adjustment should be optional

FZambia avatar Dec 14 '22 16:12 FZambia

Also, I suppose this should be tested with read-intensive commands – in https://github.com/FZambia/pipelines I only tried PUBLISH command (which has a very small response), but probably if client reads a lot from Redis then more smooth write flushes of read-intensive commands may be preferred for CPU - but just a shot in the dark.

FZambia avatar Dec 14 '22 17:12 FZambia

Another thought: let's say we have a bigger latency, and set some delay we are comfortable with in the app. As far as I understand this auto adjustment does not take latency into account - it may be that there is no much sense to reduce delay automatically given large RTT time (since it may result in larger CPU on Redis at the end).

BTW, I experimented locally with https://github.com/Shopify/toxiproxy yesterday to add some latency to Redis. But I am on MacOS, for Linux tc may be more obvious choice - https://daniel.haxx.se/blog/2010/12/14/add-latency-to-localhost/.

FZambia avatar Dec 14 '22 17:12 FZambia

I tried to experiment a bit. Still - the more I think about this the more I tend to think that automatic adjustment does a bit separate thing: it does not reduce CPU, but goes towards throughput. For example, for parallelism 1 and delay 100 microseconds I got similar results as you, but CPU usage Increased 3x on both app and Redis in case of using automatic adjustment. But I suppose if I wanted to improve throughput - I could just tune the MaxFlushDelay parameter myself as I know my app and its load as a developer better.

This may be a good thing in general, but it feels a completely different mode than we had before with MaxFlushDelay. The point of adding delay in my case was to reduce CPU, but still have acceptable throughput. As application nodes/instances usually scale faster than Redis.

And I still doubt whether Rueidis really needs complex behaviour like this - probably better to drop the idea at all otherwise we risk to get pretty unmanageable behavior. Really worry about the fact that adjustments in runtime may be too tricky to maintain and there are too many corner cases, types/size of data, latency involved.

Maybe I am too conservative in that – but just prefer keeping things as simple as possible.

FZambia avatar Dec 15 '22 15:12 FZambia

reasonable CPU is a bit subjective matte

Indeed. I use the word reasonable because the CPU usage of this approach is similar to redigo's.

Is it possible to describe the logic behind this formula?

Let's start with the question:

How long should we wait for more commands?

Sure, we could just sleep for MaxFlushDelay specified by the user and have the lowest CPU usage. But it could easily wait too long and result in very poor throughput.

For example, sleeping for 100 microseconds reduced the throughput by ~290%.

▶ benchstat ori-0us-p1.txt old-100us-p1.txt
name        old time/op    new time/op    delta
Rueidis-10    3.08µs ± 0%   11.99µs ± 0%  +289.49%  (p=0.000 n=19+20)

To avoid sleeping too long, I would like to make this sleep including the successive flush similar to previous flushes. In other words, I want time(sleep) + time(flush) == avg time(previous flushes).

Assuming the time(flush) is proportional to how many bytes it flushes, then I can have an average flush time of a byte donated as byteWork.

I can also have an approximated average receiving time of a byte, by tracking durations between each explicit flush, donated as byteWait.

Therefore, the original question can be approximated to How many bytes I should wait for to make them take a similar time to average flushes donated as nByte = avg time(previous flushes) / (byteWait + byteWork)

Then the sleeping duration can be nByte * byteWait.

rueian avatar Dec 15 '22 15:12 rueian

But I suppose if I wanted to improve throughput - I could just tune the MaxFlushDelay parameter myself as I know my app and its load as a developer better.

Sure, but there are many cases that their traffic pattern can't be fit to a fixed MaxFlushDelay. Adaptive approach may still be good to have.

The point of adding delay in my case was to reduce CPU, but still have acceptable throughput.

We probably can also have a mode allowing user to set acceptable throughput then trying to reduce CPU usage.

Maybe I am too conservative in that – but just prefer keeping things as simple as possible.

Don't worry. This won't be merged until we find a better solution.

rueian avatar Dec 15 '22 16:12 rueian

Current MaxFlushDelay, while a bit manual, provides a working understandable way to tradeoff throughput/CPU. And it works deterministically. Maybe selected value is not a very good fit for some setup – but possible to tune. I'd say – it's already like acceptable latency (well, not exactly of course, more like acceptable delay but it seems good enough to solve a problem).

I think you implemented some bright ideas here, but I am not sure at this point that I'd use this mode somehow. Probably it's better to wait for more feedback here, maybe from other users.

FZambia avatar Dec 16 '22 19:12 FZambia

Thinking more.. Not sure whether it's good idea or not – can this batching strategy be a pluggable element somehow – just to keep Rueidis core simple yet extensible

FZambia avatar Dec 17 '22 08:12 FZambia

Thinking more.. Not sure whether it's good idea or not – can this batching strategy be a pluggable element somehow – just to keep Rueidis core simple yet extensible

It is a good idea. Probably something like this func(ctx CurrentFlushContext) time.Duration allows users to choose how long they want to wait based on some information.

rueian avatar Dec 17 '22 15:12 rueian