cilium icon indicating copy to clipboard operation
cilium copied to clipboard

ipam: Change default rate limiting access to external APIs

Open gandro opened this issue 2 years ago • 1 comments

This PR addresses three issues around cloud API rate limiting in the operator (one commit per problem):

  1. A bug is fixed where the actual rate limit under load was half of the requested rate limit
  2. A bug is fixed where the Azure IPAM did not rate limit all API calls
  3. The rate limit defaults are adjusted to better reflect the rate limits of the cloud provider. See below for details.

Note: A subsequent PR will be opened to make the operator more robust in case of failiures, such as the operator leaking resources when rate limits are being hint. But since those changes are largely unrelated to the rate limit logic itself (e.g. the operator can also leak resources when crashing etc), I decided to keep the two PRs separate.


New IPAM cloud provider API rate limits

When provisioning network interfaces and IP addresses for Cilium ENI, Azure or AlibabaCloud IPAM, the operator has to issue requests to the cloud provider API. Those requests might be subject to throttling if the operator issues too many requests in a short time window (which an for example happen if many nodes are added to the cluster at once). Users have reported that the current limits are not enough to avoid throttling on the cloud provider. If throttling occurs, some operations (like interface deletion) might fail, causing potential resource leaks.

Therefore, this commit adjusts the rate limit used by Cilium operator. The new limit consist of a bucket size of 20 (also called the burst size) and a bucket refill rate (called QPS in the config flag) of 4. The old refill rate of 20 (effectively 10 under load, due to a bug fixed in the previous commit) was reportedly high enough to be throttled on the Amazon EC2 API under load.

Therefore, the new refill rate of 4 queries per second (under load) is chosen rather conservatively, to ensure we are not throttled by the backend API. To compensate for that, the burst size is raised to 20, meaning the operator can issue up to 20 operations before throttling to 4 queries per second kicks in.

It should be noted Cilium's rate limiter is shared for all operations (read and write), where as cloud providers have more sophisticated per operation quotas. But since the interaction in a real systems is likely even more complicated than that (e.g. the quota is shared likely with other API users too), we stick to a rather simple and conservative limit by default.

If the new default limit is too low for sophisticated or large-scale users, it remains possible to overwrite the new default values if needed.

For reference, these are the documented API rate limits on supported cloud providers:

Cilium Request Rate

Type Burst RPS
Old 4 20
New 20 4

Amazon EC2 API ^1

Type Burst RPS
Reads 100 20
Writes 200 10

Azure API

Network API ^2

Type Burst* RPS*
Reads 1000 3.33
Writes 10000 33.3

*Estimated burst refill rate per second based on their 5 minute window.

Compute API ^3

The compute API does not document what the default limits are, and it seems like the actual setup is more sophisticated. But the rates that apply overall seems to be in the same ballpark as the Compute API given the sample response on the page.

Alibaba ECS API ^4

According to the public documentation, the limits are only visible to logged in users. They may vary based on region and subscription. We assume they are also in the same ballpark as the other two providers.

gandro avatar Sep 21 '22 13:09 gandro

Pushed an additional commit which adds the upgrade note and Helm values to configure the setting.

gandro avatar Sep 21 '22 14:09 gandro

Fixed missing cmdref update in https://github.com/cilium/cilium/pull/21387/commits/f0ea5b6dd43f7dc2ac65ed360f5359deef13182c

gandro avatar Sep 22 '22 08:09 gandro

/test

gandro avatar Sep 22 '22 10:09 gandro

/test-1.16-4.9

gandro avatar Sep 22 '22 17:09 gandro

/test-1.23-5.4

gandro avatar Sep 22 '22 17:09 gandro

/test-1.24-4.19

gandro avatar Sep 22 '22 17:09 gandro

/test-1.25-net-next

gandro avatar Sep 22 '22 17:09 gandro

/test-runtime

Job 'Cilium-PR-K8s-1.25-kernel-net-next' failed:

Click to show.

Test Name

K8sDatapathConfig Check BPF masquerading with ip-masq-agent VXLAN

Failure Output

FAIL: Failed to add ip route

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.25-kernel-net-next so I can create one.

gandro avatar Sep 22 '22 17:09 gandro

  • Runtime seems to have hit #19093. Restarting to make sure this is unrelated.
  • Net-next hit known flake #21120 - not restarting this one, it's clear that this is unrelated.

gandro avatar Sep 26 '22 09:09 gandro

/test-runtime

gandro avatar Sep 26 '22 09:09 gandro

@pippolo84 Thanks for the feedback, addressed your comments.

Rebased on master, the main diff between the last push is the following:

diff --git a/pkg/api/helpers/rate_limit_test.go b/pkg/api/helpers/rate_limit_test.go
index 47393229483b..62bac916dedb 100644
--- a/pkg/api/helpers/rate_limit_test.go
+++ b/pkg/api/helpers/rate_limit_test.go
@@ -33,13 +33,13 @@ func (e *HelpersSuite) TestRateLimitBurst(c *check.C) {
        for i := 0; i < 10; i++ {
                limiter.Limit(context.TODO(), "test")
        }
-       c.Assert(metricsAPI.RateLimit("test"), checker.DeepEquals, time.Duration(0))
+       c.Assert(metricsAPI.RateLimit("test"), check.Equals, time.Duration(0))
 
        // Rate limit should now kick in (use an expired context to avoid waiting 1sec)
        ctx, cancel := context.WithTimeout(context.TODO(), time.Microsecond)
        defer cancel()
        limiter.Limit(ctx, "test")
-       c.Assert(metricsAPI.RateLimit("test"), check.Not(checker.DeepEquals), time.Duration(0))
+       c.Assert(metricsAPI.RateLimit("test"), check.Not(checker.Equals), time.Duration(0))
 }
 
 func (e *HelpersSuite) TestRateLimitWait(c *check.C) {
@@ -49,9 +49,9 @@ func (e *HelpersSuite) TestRateLimitWait(c *check.C) {
 
        // Exhaust bucket
        limiter.Limit(context.TODO(), "test")
-       c.Assert(metricsAPI.RateLimit("test"), checker.DeepEquals, time.Duration(0))
+       c.Assert(metricsAPI.RateLimit("test"), checker.Equals, time.Duration(0))
 
-       // Hit rate limit 15 times. The bucket refill rete is 100 per second,
+       // Hit rate limit 15 times. The bucket refill rate is 100 per second,
        // meaning we expect this to take around 15 * 10 = 150 milliseconds
        start := time.Now()
        for i := 0; i < 15; i++ {

gandro avatar Sep 26 '22 13:09 gandro

I left another question inline regarding the default. Thanks for the thorough description of the problem in the commit message :100: :+1:

pippolo84 avatar Sep 26 '22 13:09 pippolo84

/test

gandro avatar Sep 26 '22 16:09 gandro

k8s-1.23-kernel-5.4 hit an unrelated error, created flake issue https://github.com/cilium/cilium/issues/21460

This is ready to merge now.

gandro avatar Sep 27 '22 16:09 gandro