perf: improve user specified flush delay automatically by statistics
Choose flush delay automatically by statistics from previous flush histories.
It generally achieves better throughput than just using MaxFlushDelay provided by users directly while still keeping reasonable CPU usage.
Benchmark comparison to the old v0.0.89 by using the code https://github.com/FZambia/pipelines:
PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=100
▶ PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=100 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > old-100us-p1.txt
▶ PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=100 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > new-100us-p1.txt
▶ benchstat old-100us-p1.txt new-100us-p1.txt
name old time/op new time/op delta
Rueidis-10 12.0µs ± 0% 3.8µs ± 2% -68.47% (p=0.000 n=20+20)
name old alloc/op new alloc/op delta
Rueidis-10 80.0B ± 0% 80.0B ± 0% ~ (all equal)
name old allocs/op new allocs/op delta
Rueidis-10 1.00 ± 0% 1.00 ± 0% ~ (all equal)
PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=100
▶ PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=100 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > old-100us-p128.txt
▶ PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=100 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > new-100us-p128.txt
▶ benchstat old-100us-p128.txt new-100us-p128.txt
name old time/op new time/op delta
Rueidis-10 601ns ± 2% 589ns ± 2% -1.89% (p=0.000 n=20+19)
name old alloc/op new alloc/op delta
Rueidis-10 80.0B ± 0% 80.0B ± 0% ~ (all equal)
name old allocs/op new allocs/op delta
Rueidis-10 1.00 ± 0% 1.00 ± 0% ~ (all equal)
PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=20
▶ PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=20 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > old-20us-p1.txt
▶ PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=20 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > new-20us-p1.txt
▶ benchstat old-20us-p1.txt new-20us-p1.txt
name old time/op new time/op delta
Rueidis-10 4.43µs ± 1% 3.80µs ± 2% -14.38% (p=0.000 n=20+20)
name old alloc/op new alloc/op delta
Rueidis-10 80.0B ± 0% 80.0B ± 0% ~ (all equal)
name old allocs/op new allocs/op delta
Rueidis-10 1.00 ± 0% 1.00 ± 0% ~ (all equal)
PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=20
▶ PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=20 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > old-20us-p128.txt
▶ PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=20 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > new-20us-p128.txt
▶ benchstat old-20us-p128.txt new-20us-p128.txt
name old time/op new time/op delta
Rueidis-10 586ns ± 3% 587ns ± 3% ~ (p=0.622 n=20+19)
name old alloc/op new alloc/op delta
Rueidis-10 80.0B ± 0% 80.0B ± 0% ~ (all equal)
name old allocs/op new allocs/op delta
Rueidis-10 1.00 ± 0% 1.00 ± 0% ~ (all equal)
Codecov Report
Base: 96.19% // Head: 96.17% // Decreases project coverage by -0.01% :warning:
Coverage data is based on head (
ba3adb3) compared to base (28cad24). Patch coverage: 73.07% of modified lines in pull request are covered.
Additional details and impacted files
@@ Coverage Diff @@
## main #159 +/- ##
==========================================
- Coverage 96.19% 96.17% -0.02%
==========================================
Files 36 36
Lines 25382 25401 +19
==========================================
+ Hits 24415 24429 +14
- Misses 846 851 +5
Partials 121 121
| Impacted Files | Coverage Δ | |
|---|---|---|
| pipe.go | 99.13% <73.07%> (-0.62%) |
:arrow_down: |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
related to https://github.com/rueian/rueidis/issues/156
Hello @rueian - this looks interesting, some initial thoughts after looking at PR (have not even run it yet):
- Is it possible to describe the logic behind this formula?
- I was thinking on a topic a bit – to be fair I am a bit nervous at the moment to enable non-deterministic connection behaviour with auto adjustment –
reasonableCPU is a bit subjective matter, so I think automatic adjustment should be optional
Also, I suppose this should be tested with read-intensive commands – in https://github.com/FZambia/pipelines I only tried PUBLISH command (which has a very small response), but probably if client reads a lot from Redis then more smooth write flushes of read-intensive commands may be preferred for CPU - but just a shot in the dark.
Another thought: let's say we have a bigger latency, and set some delay we are comfortable with in the app. As far as I understand this auto adjustment does not take latency into account - it may be that there is no much sense to reduce delay automatically given large RTT time (since it may result in larger CPU on Redis at the end).
BTW, I experimented locally with https://github.com/Shopify/toxiproxy yesterday to add some latency to Redis. But I am on MacOS, for Linux tc may be more obvious choice - https://daniel.haxx.se/blog/2010/12/14/add-latency-to-localhost/.
I tried to experiment a bit. Still - the more I think about this the more I tend to think that automatic adjustment does a bit separate thing: it does not reduce CPU, but goes towards throughput. For example, for parallelism 1 and delay 100 microseconds I got similar results as you, but CPU usage Increased 3x on both app and Redis in case of using automatic adjustment. But I suppose if I wanted to improve throughput - I could just tune the MaxFlushDelay parameter myself as I know my app and its load as a developer better.
This may be a good thing in general, but it feels a completely different mode than we had before with MaxFlushDelay. The point of adding delay in my case was to reduce CPU, but still have acceptable throughput. As application nodes/instances usually scale faster than Redis.
And I still doubt whether Rueidis really needs complex behaviour like this - probably better to drop the idea at all otherwise we risk to get pretty unmanageable behavior. Really worry about the fact that adjustments in runtime may be too tricky to maintain and there are too many corner cases, types/size of data, latency involved.
Maybe I am too conservative in that – but just prefer keeping things as simple as possible.
reasonableCPU is a bit subjective matte
Indeed. I use the word reasonable because the CPU usage of this approach is similar to redigo's.
Is it possible to describe the logic behind this formula?
Let's start with the question:
How long should we wait for more commands?
Sure, we could just sleep for MaxFlushDelay specified by the user and have the lowest CPU usage. But it could easily wait too long and result in very poor throughput.
For example, sleeping for 100 microseconds reduced the throughput by ~290%.
▶ benchstat ori-0us-p1.txt old-100us-p1.txt
name old time/op new time/op delta
Rueidis-10 3.08µs ± 0% 11.99µs ± 0% +289.49% (p=0.000 n=19+20)
To avoid sleeping too long, I would like to make this sleep including the successive flush similar to previous flushes. In other words, I want time(sleep) + time(flush) == avg time(previous flushes).
Assuming the time(flush) is proportional to how many bytes it flushes, then I can have an average flush time of a byte donated as byteWork.
I can also have an approximated average receiving time of a byte, by tracking durations between each explicit flush, donated as byteWait.
Therefore, the original question can be approximated to How many bytes I should wait for to make them take a similar time to average flushes donated as nByte = avg time(previous flushes) / (byteWait + byteWork)
Then the sleeping duration can be nByte * byteWait.
But I suppose if I wanted to improve throughput - I could just tune the MaxFlushDelay parameter myself as I know my app and its load as a developer better.
Sure, but there are many cases that their traffic pattern can't be fit to a fixed MaxFlushDelay. Adaptive approach may still be good to have.
The point of adding delay in my case was to reduce CPU, but still have acceptable throughput.
We probably can also have a mode allowing user to set acceptable throughput then trying to reduce CPU usage.
Maybe I am too conservative in that – but just prefer keeping things as simple as possible.
Don't worry. This won't be merged until we find a better solution.
Current MaxFlushDelay, while a bit manual, provides a working understandable way to tradeoff throughput/CPU. And it works deterministically. Maybe selected value is not a very good fit for some setup – but possible to tune. I'd say – it's already like acceptable latency (well, not exactly of course, more like acceptable delay but it seems good enough to solve a problem).
I think you implemented some bright ideas here, but I am not sure at this point that I'd use this mode somehow. Probably it's better to wait for more feedback here, maybe from other users.
Thinking more.. Not sure whether it's good idea or not – can this batching strategy be a pluggable element somehow – just to keep Rueidis core simple yet extensible
Thinking more.. Not sure whether it's good idea or not – can this batching strategy be a pluggable element somehow – just to keep Rueidis core simple yet extensible
It is a good idea. Probably something like this func(ctx CurrentFlushContext) time.Duration allows users to choose how long they want to wait based on some information.