ads: support envoy filter local ratelimit.
What type of PR is this? support local rate limit.
What this PR does / why we need it:
Which issue(s) this PR fixes: Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Codecov Report
Attention: Patch coverage is 88.57143% with 4 lines in your changes missing coverage. Please review.
Project coverage is 52.17%. Comparing base (
1f2940b) to head (6876096). Report is 130 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| pkg/controller/ads/cache.go | 0.00% | 4 Missing :warning: |
| Files with missing lines | Coverage Δ | |
|---|---|---|
| pkg/controller/ads/extensions/local_ratelimit.go | 100.00% <100.00%> (ø) |
|
| pkg/utils/test/bpf_map.go | 34.04% <100.00%> (-3.11%) |
:arrow_down: |
| pkg/controller/ads/cache.go | 47.94% <0.00%> (+0.39%) |
:arrow_up: |
... and 69 files with indirect coverage changes
Continue to review full report in Codecov by Sentry.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update e5b8028...6876096. Read the comment docs.
First we need a proposal to better understand this implement
hi i have created a proposal pr in #873
@hzxuzhonghu hi, ratelimit use the same way to cut tcp connection as circuit breaker, it can be merged after #570.
lgtm, and your test report(google doc) can be pasted here to show your test results @yuanqijing
lgtm, and your test report(google doc) can be pasted here to show your test results @yuanqijing
hi, i have add a google doc here, This PR might only take effect after #570 is merged. Currently, the corresponding log output can be observed in the code. I manually merged the disconnection handling capability from #570 and tested it locally.
@yuanqijing Does it support http rate limit
@yuanqijing Does it support http rate limit
No, it only support tcp connections rate limit as describe in proposal #873.
@tacslon Ptal
lgtm, please resolve the conflict @yuanqijing
Go test:
bpf_map.go:40: Failed to mount /mnt/kmesh_cgroup2/: device or resource busy
cause TestServer_getAndSetBpfLevel Failed, this was wired
@tacslon conflict addressed. a notice: I add bpf.CleanupBpfMap() in InitBpfMap, the tests passed. Should I create another pr to fix it?
func InitBpfMap(t *testing.T, config options.BpfConfig) (CleanupFn, *bpf.BpfLoader) {
+ bpf.CleanupBpfMap()
err := os.MkdirAll("/mnt/kmesh_cgroup2", 0755)
...
}
@tacslon conflict addressed. a notice: I add bpf.CleanupBpfMap() in InitBpfMap, the tests passed. Should I create another pr to fix it?
func InitBpfMap(t *testing.T, config options.BpfConfig) (CleanupFn, *bpf.BpfLoader) { + bpf.CleanupBpfMap() err := os.MkdirAll("/mnt/kmesh_cgroup2", 0755) ... }
No problem for me, ptal @hzxuzhonghu
We also need to delete kmesh_ratelimit whenever a pod deleted, otherwise it cause mem leak
We also need to delete
kmesh_ratelimitwhenever a pod deleted, otherwise it cause mem leak
Thanks, you're right.
Should i create a global var (option 1) or modify NewCache Function Signiture(option 2) like this commit do.
var BpfAds *BpfAds
var once sync.Once
// NewBpfAds creates or returns the existing BpfAds instance, ensuring that only one instance is created (singleton).
func NewBpfAds(cfg *options.BpfConfig) (*BpfAds, error) {
var err error
once.Do(func() {
BpfAds = &BpfAds{}
err = BpfAds.SockConn.NewBpf(cfg)
if err != nil {
BpfAds = nil
}
})
return BpfAds, err
}
Option 1: Easy to implement, no impact on existing code, but exposing a global variable might introduce risks. Option 2: Affects the interface. Or add a new function, like NewCacheWithConfig(cfg *CacheConfig), to avoid breaking changes.
We also need to delete
kmesh_ratelimitwhenever a pod deleted, otherwise it cause mem leakThanks, you're right.
Should i create a global var (option 1) or modify NewCache Function Signiture(option 2) like this commit do.
var BpfAds *BpfAds var once sync.Once // NewBpfAds creates or returns the existing BpfAds instance, ensuring that only one instance is created (singleton). func NewBpfAds(cfg *options.BpfConfig) (*BpfAds, error) { var err error once.Do(func() { BpfAds = &BpfAds{} err = BpfAds.SockConn.NewBpf(cfg) if err != nil { BpfAds = nil } }) return BpfAds, err }Option 1: Easy to implement, no impact on existing code, but exposing a global variable might introduce risks. Option 2: Affects the interface. Or add a new function, like NewCacheWithConfig(cfg *CacheConfig), to avoid breaking changes.
Introducing a global variable may not be a good choice, option 2(add a new function) should be better
We also need to delete
kmesh_ratelimitwhenever a pod deleted, otherwise it cause mem leak
@hzxuzhonghu Maybe this PR can be merged, then @yuanqijing creates another PR to fix this issue?
@yuanqijing I think both @nlgwcy and me are concerned this is a per node ratelimit, while with istio generated xds, it is per pod. https://github.com/kmesh-net/kmesh/pull/859#discussion_r1810778940
@yuanqijing I think both @nlgwcy and me are concerned this is a per node ratelimit, while with istio generated xds, it is per pod. #859 (comment)
Yes, this is a per node ratelimit, it does not ratelimit per pod connection rates. It ratelimit downstream pods to upstream service. The ducument:
Envoy will call the rate limit service for every new connection on the listener where the filter is installed. The configuration specifies a specific domain and descriptor set to rate limit on.
seems ratelimit does not specifiy which downstream pods should be processed, i dont get it. I understand that traditional rate limiting can restrict each pod, which makes sense. However, this might not align with Envoy's documentation.
@yuanqijing I think both @nlgwcy and me are concerned this is a per node ratelimit, while with istio generated xds, it is per pod. #859 (comment)
Regarding this point
while with Istio generated xDS, it is per pod'
I don't quite understand which rate limit configuration case the statement refers to. Could you provide a specific example for discussion?
In kmesh, we are a little different, we can only ratelimit on the client side now. So take the example here, though it is http.
https://istio.io/latest/docs/tasks/policy-enforcement/rate-limit/#local-rate-limit
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: filter-local-ratelimit-svc
namespace: istio-system
spec:
workloadSelector:
labels:
app: productpage
configPatches:
- applyTo: NETWORK_FILTER
match:
context: SIDECAR_OUTBOUND
It should be applied on each client instance, not on all the instances of a node.
In kmesh, we are a little different, we can only ratelimit on the client side now. So take the example here, though it is http.
https://istio.io/latest/docs/tasks/policy-enforcement/rate-limit/#local-rate-limit
apiVersion: networking.istio.io/v1alpha3 kind: EnvoyFilter metadata: name: filter-local-ratelimit-svc namespace: istio-system spec: workloadSelector: labels: app: productpage configPatches: - applyTo: NETWORK_FILTER match: context: SIDECAR_OUTBOUNDIt should be applied on each client instance, not on all the instances of a node.
Understood, so the key point here is whether intances in a node share a token bucket. option 1: client_A->Service_C: 5 token, client_B->Service_C: 6 token option 2: Service_C: 11 token.
Envoy using option 2, but in kmesh we use option1 ? Is my understanding correct?
Envoy using option 2, but in kmesh we use option1 ? Is my understanding correct?
Envoy can use both, kmesh we can only use option 1
Envoy using option 2, but in kmesh we use option1 ? Is my understanding correct?
Envoy can use both, kmesh we can only use option 1
I have added a new field to distinguish different pods:
struct ratelimit_key {
union {
struct {
__u64 netns; /* Network namespace. */
__u32 ipv4; /* Destination IPv4 address. */
__u32 port; /* Destination port. */
__u32 family; /* Address family (e.g., AF_INET) */
} sk_skb;
} key;
};
@hzxuzhonghu Hi I think there are 2 pending Issues:
- When cleaning up the listener, also clean up the ratelimit configuration. Can I create another PR to fix this?
- The concurrent issue here may not have been fully discussed yet.
- When cleaning up the listener, also clean up the ratelimit configuration. Can I create another PR to fix this?
Yes, i think its ok
- The concurrent issue here may not have been fully discussed yet.
@nlgwcy suggested using atomic update
- When cleaning up the listener, also clean up the ratelimit configuration. Can I create another PR to fix this?
Yes, i think its ok
- The concurrent issue here may not have been fully discussed yet.
@nlgwcy suggested using atomic update
suggested atomic updates all addressed. https://github.com/kmesh-net/kmesh/pull/859#discussion_r1810771130 https://github.com/kmesh-net/kmesh/pull/859#discussion_r1810771130
/lgtm
New changes are detected. LGTM label has been removed.
/retest
/approve