ratelimit
ratelimit copied to clipboard
Switch default protocol to v3
Rational In order to support newer features of the ratelimit protocol (like rate limit overrides), we’d need the service to use v3 protocol instead of v2 protocol. Thus, we propose to upgrade the ratelimiter core to v3 protocol. Since there doesn’t seem to be any real changes between v2 and v3, this seems to be as easy as replacing the imports and referencing newer versions in the data plane.
However, to support complete compatibility with v2, we’d need to have a legacy layer, just like it’s done for v1 now. Supporting 3 versions of the protocol with 2 legacies seems like an overkill, so we propose to simultaneously drop support for v1 ratelimit.proto. Seems like it should be ok given that the release schedule proposed doing so in Q32018 [2]
Proposal Thus, the concrete proposal: Protocol v1 support is dropped Protocol v2 becomes legacy and converted to v3. Given that the protocols are only different in message names, this should have almost no performance penalty The service core is updated to use protocol v3
We intend to contribute all the code. We are also planning to use v3 protocol in production in our own fork in the beginning, thus we can provide some battle testing. What do you think?
+1, awesome, thanks.
Resolved with #11595
I have realized a few followups are needed: #155
@Pchelolo after testing the latest master with these changes I noticed a big performance degradation with V2 API. Under light load the p999 latency went from 15ms => 22ms which is not ideal as we try to keep p999 under 20ms to avoid as many fail-open rate limit requests.
I did a CPU profiling with pprof and suspect it's due to more time being spent in the converting of v2 request to v3: https://github.com/envoyproxy/ratelimit/blob/master/src/service/ratelimit_legacy.go#L67-L93
Is there a better way we can convert the request? As v2 is to be officially deprecated at end of this year, moving to v3 is good (and we plan to move our Envoy configs to v3 as well) but I feel there are many orgs still on v2.
@ysawa0 thank you so much for testing. I have created #155 as a followup to this, which switches converting v2->v3 to be much more explicit. Originally that was done because one of the fields was renamed, thus serializing/unserializing doesn't work any more, it wasn't covered by tests, thus forgotten. But I feel it could be beneficial for performance as a side effect.
If you don't mind testing that version, given that you have perf tests setup, it would be great. If you don't have time for that, please let me know, I'll set myself up for performance test and see if #155 mitigates this problem.
A performance degradation is expected, since we now need to do the conversion and no matter how the conversion is done, it will cost something, but 7ms increase indeed seems too high of a price to pay. Perhaps if we don't come up with a solution, #153 should be reverted and only reapplied after v2 was deprecated?
Thanks @Pchelolo I tried out the PR and that looks better in pprof. Let me do more long-term tests and get back to you with a definitive answer!
@ysawa0 thank you!
#155 did indeed fix a lot of the issues but after testing ratelimit before the v2/v3 api code was even merged at all, I still see performance degradation. I compared two builds of ratelimit, one before #137 was merged and one after. I suspect that this Redis code update, while it improves latency for calls to Redis noticeably, it is causing the DoLimit function to spend more cpu time while processing a request. Sorry to pull you in but @caitong93 do you have any insight into this?
image of p999/p99 latency bumping up when I switch versions
@ysawa0
https://github.com/envoyproxy/ratelimit/pull/137 make redis client do pipelining more efficiently to improve throughput (by buffering as much as possible commands from concurrent requests in a time window), so it is possible to introduce more latency.
Please try to set REDIS_PIPELINE_WINDOW
and REDIS_PIPELINE_LIMIT
to values smaller than defaults https://github.com/envoyproxy/ratelimit/pull/137/files#diff-04c6e90faac2675aa89e2176d2eec7d8R419
it is causing the DoLimit function to spend more cpu time while processing a request
I will try to investigate this.
@caitong93 I tested with
- name: REDIS_PIPELINE_WINDOW
value: "0"
- name: REDIS_PIPELINE_LIMIT
value: "0"
this would turn the feature completely off, correct?
this would turn the feature completely off, correct?
Yes, in this case, redis commands will be sent immediately
Apologizes for delay in this update. Been trying to find the time to do this test the right way. Here's what I saw after trying 3 builds. For Redis, I let an Envoy sidecar do the routing via the Redis filter and turned off pipelining.
All reported latencies are the p999 under ~2k rps.
With latest (after #158 merge, commit d7d563a) 20ms latency. After #137 merge (commit afda91c) 22ms latency With commit 1657473, right before 137, 16.5ms latency.
Not sure if anyone else is having the same experience as us but that's what I recorded. Of course I could be configuring something incorrectly or it's due to some uniqueness with our system. In that case, a second set of eyes would be great! I'll do some profiling to see if there's any low hanging fruit to optimize. I think one thing could be replacing logrus with something more performant like zap.
Comparing benchmark stats for no pipeline case (by run https://github.com/envoyproxy/ratelimit/blob/master/test/redis/bench_test.go#L68), I do observe cpu performance degradation.
benchstat /tmp/ratelimit-before-1657473a.stat /tmp/ratelimit-master.stat
name old time/op new time/op delta
ParallelDoLimit/no_pipeline-4 12.4µs ± 0% 16.1µs ± 0% ~ (p=1.000 n=1+1)
name old alloc/op new alloc/op delta
ParallelDoLimit/no_pipeline-4 624B ± 0% 302B ± 0% ~ (p=1.000 n=1+1)
name old allocs/op new allocs/op delta
ParallelDoLimit/no_pipeline-4 29.0 ± 0% 16.0 ± 0% ~ (p=1.000 n=1+1)
Actually before #137 , the client use explicitly pipelining, this may lead to different main reasons, will try to send a PR to optimize in this week
@ysawa0 can you try this to see if it can fix the problem https://github.com/envoyproxy/ratelimit/pull/163
open a new issue to track this https://github.com/envoyproxy/ratelimit/issues/164
@caitong93 Beautiful. I've run #163 vs pre #137 and they are reporting basically same p999 latency and timeouts! Thank you!