envoy
envoy copied to clipboard
Shared, global RLQS client & buckets cache
Commit Message: Currently the RLQS client & bucket cache in use by the rate_limit_quota filter is set to be per-thread. This causes each client to only have visibility into a small section of the total traffic seen by the Envoy and multiplicatively increases the number of concurrent, managed streams to the RLQS backend.
This PR will merge the bucket caches to a single, shared map that is thread-safe to access and shared via TLS. Unsafe operations (namely creation of a new index in the bucket cache & setting of quota assignments from RLQS responses) are done by the main thread against a single source-of-truth, then pushed out to worker threads (again via pointer swap + TLS).
Local threads will also no longer have access to their own RLQS clients + streams. Instead, management of a single, shared RLQS stream will be done on the main thread, by a global client object. That global client object will handle the asynchronous generation & sending of RLQS UsageReports, as well as the processing of incoming RLQS Responses into actionable quota assignments for the filter worker-threads to pull from the buckets cache.
Additional Description:
The biggest TODO after submission will be supporting the reporting_interval
field & handling reporting on different timers if buckets are configured with different intervals.
Risk Level: Medium
Testing:
- New unit testing of both global & local client objects
- New unit testing of filter logic
- Updates to existing config unit testing
- New integration testing for all of the moving parts.
Hi @bsurber, welcome and thank you for your contribution.
We will try to review your Pull Request as quickly as possible.
In the meantime, please take a look at the contribution guidelines if you have not done so already.
@bsurber could you resolve the merge conflict please - i think that is what is preventing ci from working
/assign @tyxia
@bsurber please fix code format. You can run the bazel run //tools/code_format:check_format -- fix
or using this diff: https://dev.azure.com/cncf/envoy/_build/results?buildId=169874&view=artifacts&pathAsName=false&type=publishedArtifacts
/wait
Of note, the added load largely won't be on the worker threads, as they only ever touch shared resources to read a pointer from the thread-local cache, increment atomics, and potentially query a shared tokenbucket (but that's the same in the per-worker-thread model). The only new contention is that added by a) the atomics (so minimal), and b) thread-local-storage.
Instead, my main concern to test is the added load on the main thread, which has to do write operations against the cache + source-of-truth when the cache is first initialized for each bucket, when sending RLQS usage reports, and when processing RLQS responses into quota assignments then writing them into the source-of-truth + cache.
Looks like this needs more test coverage, and also a merge. /wait
Ah, still slightly off the coverage limit there. (Edit: Actually, quite far off, I need to remove some defensive coding to follow Envoy style standards).
/wait (for CI)
Just a drive-by comment: this is a huge PR. Will it be possible to break it down to smaller PRs that can better reviewed? Ont high-level thing is that there seems to be a large refactor happening in this PR. Maybe it's possible to start with a PR that just does the refactoring (no change to the current behavior), and gradually add PR(s) that modify/extend the functionality.
@tyxia PTAL?
@bsurber What is our current strategy/status of load test (which is the determining factor of this PR I think).
Let's sync internally on this.
/wait-any
Just a drive-by comment: this is a huge PR. Will it be possible to break it down to smaller PRs that can better reviewed? Ont high-level thing is that there seems to be a large refactor happening in this PR. Maybe it's possible to start with a PR that just does the refactoring (no change to the current behavior), and gradually add PR(s) that modify/extend the functionality.
I did aim to start with a smaller refactor but any intermediate states left the code progressively dirtier. This was mostly because the fundamental quotabucket had to be changed and the existing client class structures do not fit cleanly into a shared data + worker data design. So rather than create a bunch of dirty code that was in an intentionally confusing state as intermediate changes by trying to reuse existing structures, I just scrapped the majority of what was there and started fresh.
/wait-any
i think this is waiting for our internal load testing.
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!
This is falling out of sync as other work is prioritized, but will be caught up SoonTM
/wait Needs main merge
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!
The branch has been sync'd and all missing features implemented, namely action expiration & fallback, and abandon action processing.
/retest
/retest
PR review reminder @yanavlasov
/wait
Just FYI, i am reviewing this PR, but it is fairly large change that will take some time.
We have discussed and agreed on the high-level direction of this PR internally as potential option. The integration test (like UG verification) and load test will be good signals to have for merging.
Updated to confirm to RLQS specs by having the Global client send an immediate usage report when each bucket is hit for the first time, notifying the backend to send any assignments for that bucket that may be relevant before the next usage reporting cycle (e.g. if the reporting interval is on the scale of minutes).
/wait
Waiting for internal tests