envoy Shared, global RLQS client & buckets cache

Commit Message: Currently the RLQS client & bucket cache in use by the rate_limit_quota filter is set to be per-thread. This causes each client to only have visibility into a small section of the total traffic seen by the Envoy and multiplicatively increases the number of concurrent, managed streams to the RLQS backend.

This PR will merge the bucket caches to a single, shared map that is thread-safe to access and shared via TLS. Unsafe operations (namely creation of a new index in the bucket cache & setting of quota assignments from RLQS responses) are done by the main thread against a single source-of-truth, then pushed out to worker threads (again via pointer swap + TLS).

Local threads will also no longer have access to their own RLQS clients + streams. Instead, management of a single, shared RLQS stream will be done on the main thread, by a global client object. That global client object will handle the asynchronous generation & sending of RLQS UsageReports, as well as the processing of incoming RLQS Responses into actionable quota assignments for the filter worker-threads to pull from the buckets cache.

Additional Description: The biggest TODO after submission will be supporting the reporting_interval field & handling reporting on different timers if buckets are configured with different intervals.

Risk Level: Medium

Testing:

New unit testing of both global & local client objects
New unit testing of filter logic
Updates to existing config unit testing
New integration testing for all of the moving parts.

May 07 '24 17:05 bsurber

Hi @bsurber, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/34009 was opened by bsurber.

see: more, trace.

May 07 '24 17:05 repokitteh-read-only[bot]

@bsurber could you resolve the merge conflict please - i think that is what is preventing ci from working

May 07 '24 17:05 phlax

/assign @tyxia

May 07 '24 19:05 adisuissa

@bsurber please fix code format. You can run the bazel run //tools/code_format:check_format -- fix or using this diff: https://dev.azure.com/cncf/envoy/_build/results?buildId=169874&view=artifacts&pathAsName=false&type=publishedArtifacts

/wait

May 08 '24 15:05 yanavlasov

Of note, the added load largely won't be on the worker threads, as they only ever touch shared resources to read a pointer from the thread-local cache, increment atomics, and potentially query a shared tokenbucket (but that's the same in the per-worker-thread model). The only new contention is that added by a) the atomics (so minimal), and b) thread-local-storage.

Instead, my main concern to test is the added load on the main thread, which has to do write operations against the cache + source-of-truth when the cache is first initialized for each bucket, when sending RLQS usage reports, and when processing RLQS responses into quota assignments then writing them into the source-of-truth + cache.

May 10 '24 19:05 bsurber

Looks like this needs more test coverage, and also a merge. /wait

May 14 '24 15:05 ravenblackx

Ah, still slightly off the coverage limit there. (Edit: Actually, quite far off, I need to remove some defensive coding to follow Envoy style standards).

May 16 '24 19:05 bsurber

/wait (for CI)

May 30 '24 21:05 jmarantz

Just a drive-by comment: this is a huge PR. Will it be possible to break it down to smaller PRs that can better reviewed? Ont high-level thing is that there seems to be a large refactor happening in this PR. Maybe it's possible to start with a PR that just does the refactoring (no change to the current behavior), and gradually add PR(s) that modify/extend the functionality.

Jun 07 '24 18:06 adisuissa

@tyxia PTAL?

Jun 11 '24 12:06 alyssawilk

@bsurber What is our current strategy/status of load test (which is the determining factor of this PR I think).

Let's sync internally on this.

Jun 13 '24 12:06 tyxia

/wait-any

Jun 13 '24 12:06 tyxia

Just a drive-by comment: this is a huge PR. Will it be possible to break it down to smaller PRs that can better reviewed? Ont high-level thing is that there seems to be a large refactor happening in this PR. Maybe it's possible to start with a PR that just does the refactoring (no change to the current behavior), and gradually add PR(s) that modify/extend the functionality.

I did aim to start with a smaller refactor but any intermediate states left the code progressively dirtier. This was mostly because the fundamental quotabucket had to be changed and the existing client class structures do not fit cleanly into a shared data + worker data design. So rather than create a bunch of dirty code that was in an intentionally confusing state as intermediate changes by trying to reuse existing structures, I just scrapped the majority of what was there and started fresh.

Jun 25 '24 18:06 bsurber

/wait-any

i think this is waiting for our internal load testing.

Jul 01 '24 13:07 tyxia

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Jul 31 '24 16:07 github-actions[bot]

This is falling out of sync as other work is prioritized, but will be caught up SoonTM

Aug 01 '24 23:08 bsurber

/wait Needs main merge

Aug 06 '24 17:08 RyanTheOptimist

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Sep 05 '24 20:09 github-actions[bot]

The branch has been sync'd and all missing features implemented, namely action expiration & fallback, and abandon action processing.

Sep 18 '24 01:09 bsurber

/retest

Sep 18 '24 18:09 bsurber

/retest

Sep 24 '24 22:09 bsurber

PR review reminder @yanavlasov

Sep 30 '24 18:09 kyessenov

/wait

Oct 04 '24 18:10 kyessenov

Just FYI, i am reviewing this PR, but it is fairly large change that will take some time.

We have discussed and agreed on the high-level direction of this PR internally as potential option. The integration test (like UG verification) and load test will be good signals to have for merging.

Oct 04 '24 18:10 tyxia

Updated to confirm to RLQS specs by having the Global client send an immediate usage report when each bucket is hit for the first time, notifying the backend to send any assignments for that bucket that may be relevant before the next usage reporting cycle (e.g. if the reporting interval is on the scale of minutes).

Oct 15 '24 23:10 bsurber

/wait

Waiting for internal tests

Oct 21 '24 14:10 tyxia

envoy envoy copied to clipboard

Shared, global RLQS client & buckets cache

envoy
envoy copied to clipboard