ratelimit icon indicating copy to clipboard operation
ratelimit copied to clipboard

Switch default protocol to v3

Open Pchelolo opened this issue 4 years ago • 15 comments

Rational In order to support newer features of the ratelimit protocol (like rate limit overrides), we’d need the service to use v3 protocol instead of v2 protocol. Thus, we propose to upgrade the ratelimiter core to v3 protocol. Since there doesn’t seem to be any real changes between v2 and v3, this seems to be as easy as replacing the imports and referencing newer versions in the data plane.

However, to support complete compatibility with v2, we’d need to have a legacy layer, just like it’s done for v1 now. Supporting 3 versions of the protocol with 2 legacies seems like an overkill, so we propose to simultaneously drop support for v1 ratelimit.proto. Seems like it should be ok given that the release schedule proposed doing so in Q32018 [2]

Proposal Thus, the concrete proposal: Protocol v1 support is dropped Protocol v2 becomes legacy and converted to v3. Given that the protocols are only different in message names, this should have almost no performance penalty The service core is updated to use protocol v3

We intend to contribute all the code. We are also planning to use v3 protocol in production in our own fork in the beginning, thus we can provide some battle testing. What do you think?

Pchelolo avatar Jun 30 '20 14:06 Pchelolo

+1, awesome, thanks.

mattklein123 avatar Jun 30 '20 15:06 mattklein123

Resolved with #11595

Pchelolo avatar Jul 01 '20 18:07 Pchelolo

I have realized a few followups are needed: #155

Pchelolo avatar Jul 02 '20 14:07 Pchelolo

@Pchelolo after testing the latest master with these changes I noticed a big performance degradation with V2 API. Under light load the p999 latency went from 15ms => 22ms which is not ideal as we try to keep p999 under 20ms to avoid as many fail-open rate limit requests.

I did a CPU profiling with pprof and suspect it's due to more time being spent in the converting of v2 request to v3: https://github.com/envoyproxy/ratelimit/blob/master/src/service/ratelimit_legacy.go#L67-L93

Is there a better way we can convert the request? As v2 is to be officially deprecated at end of this year, moving to v3 is good (and we plan to move our Envoy configs to v3 as well) but I feel there are many orgs still on v2.

ysawa0 avatar Jul 10 '20 15:07 ysawa0

@ysawa0 thank you so much for testing. I have created #155 as a followup to this, which switches converting v2->v3 to be much more explicit. Originally that was done because one of the fields was renamed, thus serializing/unserializing doesn't work any more, it wasn't covered by tests, thus forgotten. But I feel it could be beneficial for performance as a side effect.

If you don't mind testing that version, given that you have perf tests setup, it would be great. If you don't have time for that, please let me know, I'll set myself up for performance test and see if #155 mitigates this problem.

A performance degradation is expected, since we now need to do the conversion and no matter how the conversion is done, it will cost something, but 7ms increase indeed seems too high of a price to pay. Perhaps if we don't come up with a solution, #153 should be reverted and only reapplied after v2 was deprecated?

Pchelolo avatar Jul 10 '20 16:07 Pchelolo

Thanks @Pchelolo I tried out the PR and that looks better in pprof. Let me do more long-term tests and get back to you with a definitive answer!

ysawa0 avatar Jul 13 '20 16:07 ysawa0

@ysawa0 thank you!

Pchelolo avatar Jul 13 '20 16:07 Pchelolo

#155 did indeed fix a lot of the issues but after testing ratelimit before the v2/v3 api code was even merged at all, I still see performance degradation. I compared two builds of ratelimit, one before #137 was merged and one after. I suspect that this Redis code update, while it improves latency for calls to Redis noticeably, it is causing the DoLimit function to spend more cpu time while processing a request. Sorry to pull you in but @caitong93 do you have any insight into this?

image of p999/p99 latency bumping up when I switch versions image

ysawa0 avatar Jul 15 '20 23:07 ysawa0

@ysawa0

https://github.com/envoyproxy/ratelimit/pull/137 make redis client do pipelining more efficiently to improve throughput (by buffering as much as possible commands from concurrent requests in a time window), so it is possible to introduce more latency.

Please try to set REDIS_PIPELINE_WINDOW and REDIS_PIPELINE_LIMIT to values smaller than defaults https://github.com/envoyproxy/ratelimit/pull/137/files#diff-04c6e90faac2675aa89e2176d2eec7d8R419

it is causing the DoLimit function to spend more cpu time while processing a request

I will try to investigate this.

caitong93 avatar Jul 18 '20 16:07 caitong93

@caitong93 I tested with

        - name: REDIS_PIPELINE_WINDOW
          value: "0"
        - name: REDIS_PIPELINE_LIMIT
          value: "0"

this would turn the feature completely off, correct?

ysawa0 avatar Jul 23 '20 00:07 ysawa0

this would turn the feature completely off, correct?

Yes, in this case, redis commands will be sent immediately

caitong93 avatar Jul 23 '20 02:07 caitong93

Apologizes for delay in this update. Been trying to find the time to do this test the right way. Here's what I saw after trying 3 builds. For Redis, I let an Envoy sidecar do the routing via the Redis filter and turned off pipelining.

All reported latencies are the p999 under ~2k rps.

With latest (after #158 merge, commit d7d563a) 20ms latency. After #137 merge (commit afda91c) 22ms latency With commit 1657473, right before 137, 16.5ms latency.

Not sure if anyone else is having the same experience as us but that's what I recorded. Of course I could be configuring something incorrectly or it's due to some uniqueness with our system. In that case, a second set of eyes would be great! I'll do some profiling to see if there's any low hanging fruit to optimize. I think one thing could be replacing logrus with something more performant like zap.

ysawa0 avatar Aug 03 '20 19:08 ysawa0

Comparing benchmark stats for no pipeline case (by run https://github.com/envoyproxy/ratelimit/blob/master/test/redis/bench_test.go#L68), I do observe cpu performance degradation.

benchstat /tmp/ratelimit-before-1657473a.stat  /tmp/ratelimit-master.stat 
name                           old time/op    new time/op    delta
ParallelDoLimit/no_pipeline-4    12.4µs ± 0%    16.1µs ± 0%   ~     (p=1.000 n=1+1)

name                           old alloc/op   new alloc/op   delta
ParallelDoLimit/no_pipeline-4      624B ± 0%      302B ± 0%   ~     (p=1.000 n=1+1)

name                           old allocs/op  new allocs/op  delta
ParallelDoLimit/no_pipeline-4      29.0 ± 0%      16.0 ± 0%   ~     (p=1.000 n=1+1)

Actually before #137 , the client use explicitly pipelining, this may lead to different main reasons, will try to send a PR to optimize in this week

caitong93 avatar Aug 05 '20 14:08 caitong93

@ysawa0 can you try this to see if it can fix the problem https://github.com/envoyproxy/ratelimit/pull/163

open a new issue to track this https://github.com/envoyproxy/ratelimit/issues/164

caitong93 avatar Aug 07 '20 06:08 caitong93

@caitong93 Beautiful. I've run #163 vs pre #137 and they are reporting basically same p999 latency and timeouts! Thank you!

ysawa0 avatar Aug 07 '20 18:08 ysawa0