cilium
cilium copied to clipboard
ipam: Change default rate limiting access to external APIs
This PR addresses three issues around cloud API rate limiting in the operator (one commit per problem):
- A bug is fixed where the actual rate limit under load was half of the requested rate limit
- A bug is fixed where the Azure IPAM did not rate limit all API calls
- The rate limit defaults are adjusted to better reflect the rate limits of the cloud provider. See below for details.
Note: A subsequent PR will be opened to make the operator more robust in case of failiures, such as the operator leaking resources when rate limits are being hint. But since those changes are largely unrelated to the rate limit logic itself (e.g. the operator can also leak resources when crashing etc), I decided to keep the two PRs separate.
New IPAM cloud provider API rate limits
When provisioning network interfaces and IP addresses for Cilium ENI, Azure or AlibabaCloud IPAM, the operator has to issue requests to the cloud provider API. Those requests might be subject to throttling if the operator issues too many requests in a short time window (which an for example happen if many nodes are added to the cluster at once). Users have reported that the current limits are not enough to avoid throttling on the cloud provider. If throttling occurs, some operations (like interface deletion) might fail, causing potential resource leaks.
Therefore, this commit adjusts the rate limit used by Cilium operator. The new limit consist of a bucket size of 20 (also called the burst size) and a bucket refill rate (called QPS in the config flag) of 4. The old refill rate of 20 (effectively 10 under load, due to a bug fixed in the previous commit) was reportedly high enough to be throttled on the Amazon EC2 API under load.
Therefore, the new refill rate of 4 queries per second (under load) is chosen rather conservatively, to ensure we are not throttled by the backend API. To compensate for that, the burst size is raised to 20, meaning the operator can issue up to 20 operations before throttling to 4 queries per second kicks in.
It should be noted Cilium's rate limiter is shared for all operations (read and write), where as cloud providers have more sophisticated per operation quotas. But since the interaction in a real systems is likely even more complicated than that (e.g. the quota is shared likely with other API users too), we stick to a rather simple and conservative limit by default.
If the new default limit is too low for sophisticated or large-scale users, it remains possible to overwrite the new default values if needed.
For reference, these are the documented API rate limits on supported cloud providers:
Cilium Request Rate
Type | Burst | RPS |
---|---|---|
Old | 4 | 20 |
New | 20 | 4 |
Amazon EC2 API ^1
Type | Burst | RPS |
---|---|---|
Reads | 100 | 20 |
Writes | 200 | 10 |
Azure API
Network API ^2
Type | Burst* | RPS* |
---|---|---|
Reads | 1000 | 3.33 |
Writes | 10000 | 33.3 |
*Estimated burst refill rate per second based on their 5 minute window.
Compute API ^3
The compute API does not document what the default limits are, and it seems like the actual setup is more sophisticated. But the rates that apply overall seems to be in the same ballpark as the Compute API given the sample response on the page.
Alibaba ECS API ^4
According to the public documentation, the limits are only visible to logged in users. They may vary based on region and subscription. We assume they are also in the same ballpark as the other two providers.
Pushed an additional commit which adds the upgrade note and Helm values to configure the setting.
Fixed missing cmdref update in https://github.com/cilium/cilium/pull/21387/commits/f0ea5b6dd43f7dc2ac65ed360f5359deef13182c
/test
/test-1.16-4.9
/test-1.23-5.4
/test-1.24-4.19
/test-1.25-net-next
/test-runtime
Job 'Cilium-PR-K8s-1.25-kernel-net-next' failed:
Click to show.
Test Name
K8sDatapathConfig Check BPF masquerading with ip-masq-agent VXLAN
Failure Output
FAIL: Failed to add ip route
If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.25-kernel-net-next
so I can create one.
- Runtime seems to have hit #19093. Restarting to make sure this is unrelated.
- Net-next hit known flake #21120 - not restarting this one, it's clear that this is unrelated.
/test-runtime
@pippolo84 Thanks for the feedback, addressed your comments.
Rebased on master, the main diff between the last push is the following:
diff --git a/pkg/api/helpers/rate_limit_test.go b/pkg/api/helpers/rate_limit_test.go
index 47393229483b..62bac916dedb 100644
--- a/pkg/api/helpers/rate_limit_test.go
+++ b/pkg/api/helpers/rate_limit_test.go
@@ -33,13 +33,13 @@ func (e *HelpersSuite) TestRateLimitBurst(c *check.C) {
for i := 0; i < 10; i++ {
limiter.Limit(context.TODO(), "test")
}
- c.Assert(metricsAPI.RateLimit("test"), checker.DeepEquals, time.Duration(0))
+ c.Assert(metricsAPI.RateLimit("test"), check.Equals, time.Duration(0))
// Rate limit should now kick in (use an expired context to avoid waiting 1sec)
ctx, cancel := context.WithTimeout(context.TODO(), time.Microsecond)
defer cancel()
limiter.Limit(ctx, "test")
- c.Assert(metricsAPI.RateLimit("test"), check.Not(checker.DeepEquals), time.Duration(0))
+ c.Assert(metricsAPI.RateLimit("test"), check.Not(checker.Equals), time.Duration(0))
}
func (e *HelpersSuite) TestRateLimitWait(c *check.C) {
@@ -49,9 +49,9 @@ func (e *HelpersSuite) TestRateLimitWait(c *check.C) {
// Exhaust bucket
limiter.Limit(context.TODO(), "test")
- c.Assert(metricsAPI.RateLimit("test"), checker.DeepEquals, time.Duration(0))
+ c.Assert(metricsAPI.RateLimit("test"), checker.Equals, time.Duration(0))
- // Hit rate limit 15 times. The bucket refill rete is 100 per second,
+ // Hit rate limit 15 times. The bucket refill rate is 100 per second,
// meaning we expect this to take around 15 * 10 = 150 milliseconds
start := time.Now()
for i := 0; i < 15; i++ {
I left another question inline regarding the default. Thanks for the thorough description of the problem in the commit message :100: :+1:
/test
k8s-1.23-kernel-5.4 hit an unrelated error, created flake issue https://github.com/cilium/cilium/issues/21460
This is ready to merge now.