envoy LocalRateLimit(HTTP): Add dynamic token bucket support

trafficstars

Commit Message: LocalRateLimit(HTTP): Add dynamic token bucket support Additional Description: fixes: https://github.com/envoyproxy/envoy/issues/23351 and https://github.com/envoyproxy/envoy/issues/19895

User configures descriptors in the http local rate limit filter. These descriptors are the "target" to match the traffic. Only matched traffic will be rate limited. When request comes, at runtime, based on rate_limit configuration, descriptors are generated where values are picked from the request as directed by the rate_limit configuration. These generated descriptors are matched with "target"(user configured) descriptors. Generated descriptors are very flexible already in the sense that "values" from the request can be extracted using number of ways such as dynamic metadata, matcher api, computed reg expressions etc etc, but in "target"(user configured) descriptors are very rigid and it is expected that user statically configures the "values" in the descriptor. This PR is adding flexibility by allowing blank "values" in the user configured descriptors. Blank values will be treated as wildcard. Suppose descriptor entry key is client-id and value is left blank by the user, local rate limit filter will create a descriptor dynamically for each unique value of header client-id. That means client1, client2 and so on will have dedicated descriptors and token buckets.

To keep a resource consumption under limit, LRU cache is maintained for dynamic descriptors with a default size of 20, which is configurable. Whenever a new dynamic descriptor requirement is observed by any worker thread, it posts the request to main thread to add new descriptor. main thread maintains the descriptor cache and shares the copies of the cache to worker thread through thread local mechanism.

Docs Changes: TODO Release Notes: TODO

Oct 16 '24 05:10 vikaschoudhary16

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/). envoyproxy/api-shepherds assignee is @markdroth CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/36623 was opened by vikaschoudhary16.

see: more, trace.

Oct 16 '24 05:10 repokitteh-read-only[bot]

I think we still need a way to limit the overhead and memory of the token buckets. It's unacceptable to let it increases unlimited.

Oct 16 '24 09:10 wbpcode

I think we still need a way to limit the overhead and memory of the token buckets. It's unacceptable to let it increases unlimited.

Thanks a lot for taking a look. Will add logic to keep it under limits

Oct 16 '24 10:10 vikaschoudhary16

Thanks for this contribution. Dynamic descriptor support is a very complex problem in the local rate limit, considering various limitations.

I have take a pass to current implementation, but before flush more comments to the implementation details, I will throw some quesions first:

Should the LRU list be updated when one of the entry is hit in the worker? I think it should to keep the active descriptor alive. But this will also introduces more compleixity.
Should different descriptor has independent list? Or the descriptor with much more different dynamic values will occupies the whole list?
Lifetime management and search logic is much more complex now.

So, I will suggest you to take a step back and think is there other way to resolve your requirement, like a new limit filter in your fork or using the rate_limit/rate_limit_quota directly.

If you insist on to enhance the local_rate_limit, then, I think we should provide an abstraction (like DynamicDescriptor) to wrap all these complexity first (list/memory management, lifetime problem, cross workers updating, etc. (PS: I think lock may be an option if we can limit the lock only be used in the new feature)), and then, we could integrate the high level abstraction into the local_rate_limit. Or it's hard for reviewer to ensure the quality of the exist feature.

Thanks again for all your help and contribution. This is not simple feature. 🌷

Oct 20 '24 10:10 wbpcode

Thanks for this contribution. Dynamic descriptor support is a very complex problem in the local rate limit, considering various limitations.

I have take a pass to current implementation, but before flush more comments to the implementation details, I will throw some quesions first:

Should the LRU list be updated when one of the entry is hit in the worker? I think it should to keep the active descriptor alive. But this will also introduces more compleixity.

Should different descriptor has independent list? Or the descriptor with much more different dynamic values will occupies the whole list?

Lifetime management and search logic is much more complex now.

So, I will suggest you to take a step back and think is there other way to resolve your requirement, like a new limit filter in your fork or using the rate_limit/rate_limit_quota directly.

If you insist on to enhance the local_rate_limit, then, I think we should provide an abstraction (like DynamicDescriptor) to wrap all these complexity first (list/memory management, lifetime problem, cross workers updating, etc. (PS: I think lock may be an option if we can limit the lock only be used in the new feature)), and then, we could integrate the high level abstraction into the local_rate_limit. Or it's hard for reviewer to ensure the quality of the exist feature.

Thanks again for all your help and contribution. This is not simple feature. 🌷

Thanks a lot @wbpcode for taking a look. Really appreciate!!

So, I will suggest you to take a step back and think is there other way to resolve your requirement, like a new limit filter in your fork or using the rate_limit/rate_limit_quota directly.

From the conversations on the linked issues, seems like there has been repetitive request for this functionality since long time, so figuring out a solution in community might benefit number of users. yeah, rate_limit_quota can address the usecase in case user is fine with external dependency. In some cases, users are keen in avoiding external dependency and looking for solution through local rate limiter path.

If you insist on to enhance the local_rate_limit, then, I think we should provide an abstraction (like DynamicDescriptor) to wrap all these complexity first (list/memory management, lifetime problem, cross workers updating, etc. (PS: I think lock may be an option if we can limit the lock only be used in the new feature)), and then, we could integrate the high level abstraction into the local_rate_limit.

Sounds good. I will work on adding the abstraction as per your suggestion.

Oct 21 '24 03:10 vikaschoudhary16

Thanks so much for this update. I think this make sense. Here are some high level suggestions to this (I think we are in the correct way, thanks):

Please add an explict bool flag to enable this feature to avoid some users mis-use this when they configure an empty descriptor in case.
This is a new feature. So, you can ignore the timer based token bucket completely. It would make the implementation simper.
You may need to refactor the return type of requestAllowed. Now, because the buckets in the dynamic descriptor may be destroyed in another thread.

Oct 25 '24 07:10 wbpcode

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Nov 24 '24 08:11 github-actions[bot]

CC @envoyproxy/runtime-guard-changes: FYI only for changes made to (source/common/runtime/runtime_features.cc).

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/36623 was synchronize by vikaschoudhary16.

see: more, trace.

Jan 14 '25 12:01 repokitteh-read-only[bot]

Thanks so much for this update. I think this make sense. Here are some high level suggestions to this (I think we are in the correct way, thanks):

Please add an explict bool flag to enable this feature to avoid some users mis-use this when they configure an empty descriptor in case.

This is a new feature. So, you can ignore the timer based token bucket completely. It would make the implementation simper.

You may need to refactor the return type of requestAllowed. Now, because the buckets in the dynamic descriptor may be destroyed in another thread.

Hey @wbpcode , apologies for a very long delay in updating the PR. I got trapped in another downstream activity and then holidays came.

Have addressed your comments. Will be working on ci and test coverage. Meanwhile, when you get chance can you please take a look. _/_

Jan 15 '25 01:01 vikaschoudhary16

hello! sorry to interrupt you, i meet a CI problem for PreCheck as below, and i found that your ci also meet this problem~ Do you have any thoughts on this problem? I think we can communicate. I think it can be solved by re-building the bazel remote cache, but I don’t know how to trigger the rebuild of the cache.

here is my PR: https://github.com/envoyproxy/envoy/pull/36667

Jan 20 '25 05:01 duxin40

hello! sorry to interrupt you, i meet a CI problem for PreCheck as below, and i found that your ci also meet this problem~ Do you have any thoughts on this problem? I think we can communicate. I think it can be solved by re-building the bazel remote cache, but I don’t know how to trigger the rebuild of the cache.

here is my PR: #36667

I dont know how to do that. may be @phlax can help/guide us to handle it better

I just pushed an empty commit to re-trigger ci and it dint happen next time. git commit -s --allow-empty

Jan 20 '25 05:01 vikaschoudhary16

hello! sorry to interrupt you, i meet a CI problem for PreCheck as below, and i found that your ci also meet this problem~ Do you have any thoughts on this problem? I think we can communicate. I think it can be solved by re-building the bazel remote cache, but I don’t know how to trigger the rebuild of the cache. here is my PR: #36667

I dont know how to do that. may be @phlax can help/guide us to handle it better

I just pushed an empty commit to re-trigger ci and it dint happen next time. git commit -s --allow-empty

+1 ! @phlax could you please help us? If you can help us it will be very much appreciated!

The reason I can think of currently is because my PR was proposed in October 24, and before merging the latest main code a few days ago, CI could run normally, but after merging the latest main code (my other The code has basically not been changed), and the PreCheck part of CI started to report errors. The following is the commit that I merged into the latest main code:https://github.com/envoyproxy/envoy/commit/914297e6c69f4c4a3fe313d5496a628bf795efc5#diff-3cba4d66ffa4e29abbcf8cb94c7489e8451af43fdad6859d4bbcfd17058419e2

Jan 20 '25 05:01 duxin40

same whisperer error again on ci.

Jan 20 '25 06:01 vikaschoudhary16

its possible/likely the cause of the issue is us trying to update llvm - which is currently inherently non-hermetic

rebuilding the cache is not really possible - we could force everything into a ~new cache - but that is highly undesirable, especially given its one test that is failing (arm/type_whisperer), and its not currently failing on main afaict

the problem is likely to resolve itself (once the relevant cache gets blown) - will keep monitoring

Jan 20 '25 10:01 phlax

the problem is likely to resolve itself (once the relevant cache gets blown) - will keep monitoring

thanks @phlax for helping us！okay i will wait for some tjme until the cache refresh～

Jan 20 '25 10:01 duxin40

@wbpcode PTAL.

Jan 23 '25 13:01 adisuissa

Thanks for this great contribution. I am so happy to see this happens. I think we are at correct way after a quick check. But I prefer to review and land this after https://github.com/envoyproxy/envoy/pull/38197 because it will also simplify this PR.

Jan 26 '25 09:01 wbpcode

Thanks for this great contribution. I am so happy to see this happens. I think we are at correct way after a quick check. But I prefer to review and land this after #38197 because it will also simplify this PR.

sure, I am anyways not assuming timer token bucket in this PR as per your previous suggestion so will not break anything in this PR rather will help simplify by allowing to remove few checks/validations. So totally agree on landing #38197 first. Thanks a lot!!

Jan 27 '25 03:01 vikaschoudhary16

/wait on the other PR. please main merge when it's ready for another pass, and it'll show back up in the oncall dashboard

Jan 30 '25 14:01 alyssawilk

envoy envoy copied to clipboard

LocalRateLimit(HTTP): Add dynamic token bucket support

envoy
envoy copied to clipboard