apisix-ingress-controller icon indicating copy to clipboard operation
apisix-ingress-controller copied to clipboard

bug: The performance of ingress-controller's event handling

Open tangzhenhuang opened this issue 3 years ago • 21 comments

Issue description

After I create 3000+ apisixroute objects in the cluster, when the apisix-ingress-controller is started or restarting(OOM Maybe), due to the rate limit of client-go, the resource synchronization time is very long, so the changes of endpoints within this time will not be affected synchronize in time, causing 502 problems image

Similar to the above question, it seems that the changes of endpoints cannot be synchronized to apisix in a very timely manner. I guess it is because the link of this control loop is too long, watchEndpoints -> translate -> apisix-admin-api -> etcd

To sum up, this kind of performance definitely cannot be put into production. I think it is better for apisix to do service discovery by itself? https://github.com/apache/apisix/pull/4880

Environment

  • your apisix-ingress-controller version (output of apisix-ingress-controller version --long): 2.11.0
  • your Kubernetes cluster version (output of kubectl version): .1.18.8
  • if you run apisix-ingress-controller in Bare-metal environment, also show your OS version (uname -a):

Minimal test code / Steps to reproduce

Actual result

Error log

Expected result

Endpoints can be watched in time

tangzhenhuang avatar Dec 31 '21 03:12 tangzhenhuang

Maybe I’ve said too much, I just think it’s better if the service discovery is done by apisix itself

tangzhenhuang avatar Dec 31 '21 04:12 tangzhenhuang

Thanks for your report.

  • your apisix-ingress-controller version (output of apisix-ingress-controller version --long): 2.11.0

What's your apisix-ingress-controller version? The latest version is v1.4 (not release)

The problem you encountered is somewhat similar to #806 and https://github.com/apache/apisix-ingress-controller/pull/760

tao12345666333 avatar Dec 31 '21 04:12 tao12345666333

before #706 , we using workqueue.AddRateLimited, this will cause some problems.

This bug is due to a workqueue shared under the same resource, and a ratelimit mechanism is added to this workqueue, but we only need to add the ratelimit when retrying fails, and when normal resource changes, we should immediately add the workqueue to be processed .

tao12345666333 avatar Dec 31 '21 04:12 tao12345666333

Yes, before #706 we indeed have this issue.

gxthrj avatar Dec 31 '21 04:12 gxthrj

Thanks for your report.

  • your apisix-ingress-controller version (output of apisix-ingress-controller version --long): 2.11.0

What's your apisix-ingress-controller version? The latest version is v1.4 (not release)

The problem you encountered is somewhat similar to #806 and #760

It's v1.4,I incorrectly provided the version of apisix

tangzhenhuang avatar Dec 31 '21 04:12 tangzhenhuang

Thanks for your report.

  • your apisix-ingress-controller version (output of apisix-ingress-controller version --long): 2.11.0

What's your apisix-ingress-controller version? The latest version is v1.4 (not release) The problem you encountered is somewhat similar to #806 and #760

It's v1.4,I incorrectly provided the version of apisix

How did you install ingress-controller? using helm?

gxthrj avatar Dec 31 '21 04:12 gxthrj

I want to know if you would consider combining apisix-ingress-controller with this way: https://github.com/apache/apisix/pull/4880

tangzhenhuang avatar Dec 31 '21 04:12 tangzhenhuang

Thanks for your report.

  • your apisix-ingress-controller version (output of apisix-ingress-controller version --long): 2.11.0

What's your apisix-ingress-controller version? The latest version is v1.4 (not release) The problem you encountered is somewhat similar to #806 and #760

It's v1.4,I incorrectly provided the version of apisix

How did you install ingress-controller? using helm?

Yes, I make my own helm chart, because at that time the official only supported 1.3.0 at the highest

tangzhenhuang avatar Dec 31 '21 04:12 tangzhenhuang

I want to know if you would consider combining apisix-ingress-controller with this way: apache/apisix#4880

It has not been put into the current roadmap.

Can we make an online meeting? I want to know the specific problems you are currently encountering and your thoughts.

tao12345666333 avatar Dec 31 '21 04:12 tao12345666333

Thanks for your report.

  • your apisix-ingress-controller version (output of apisix-ingress-controller version --long): 2.11.0

What's your apisix-ingress-controller version? The latest version is v1.4 (not release) The problem you encountered is somewhat similar to #806 and #760

It's v1.4,I incorrectly provided the version of apisix

How did you install ingress-controller? using helm?

Yes, I make my own helm chart, because at that time the official only supported 1.3.0 at the highest

Please help confirm whether your own ingress-controller image contains bugfix #760.

gxthrj avatar Dec 31 '21 04:12 gxthrj

I want to know if you would consider combining apisix-ingress-controller with this way: apache/apisix#4880

It has not been put into the current roadmap.

Can we make an online meeting? I want to know the specific problems you are currently encountering and your thoughts.

Okay, why not make an appointment next week, so I can summarize it briefly.

tangzhenhuang avatar Dec 31 '21 04:12 tangzhenhuang

Okay, why not make an appointment next week, so I can summarize it briefly.

Sure. Due to the holiday, how about we make an appointment next Tuesday 14:00? Or other time you have free.

tao12345666333 avatar Dec 31 '21 04:12 tao12345666333

Okay, why not make an appointment next week, so I can summarize it briefly.

Sure. Due to the holiday, how about we make an appointment next Tuesday 14:00? Or other time you have free.

Emailed you.How about 5pm, I had other arrangements earlier.

tangzhenhuang avatar Dec 31 '21 06:12 tangzhenhuang

Okay, why not make an appointment next week, so I can summarize it briefly.

Sure. Due to the holiday, how about we make an appointment next Tuesday 14:00? Or other time you have free.

Emailed you.How about 5pm, I had other arrangements earlier.

ok.

tao12345666333 avatar Dec 31 '21 08:12 tao12345666333

After discussing with @crazyMonkey1995 , he is currently encountering the following problems:

  • He encountered some 502 errors during rolling updates of a large number of instances. (No health check is configured) The main focus here is that the endpoint update is not fast enough.

    • I think there are two pieces of information that need attention.

      1. The health check is very helpful for Apache APISIX to remove nodes in time;
      2. In #760, we have fixed the usage of workqueue and no longer limit the flow, so that the endpoint can be updated more quickly
    • action item:

      1. Perform stress testing to cover this scenario. @tao12345666333
  • The problem of APISIX Ingress controller resource limiting.

    • #760 It can solve this problem and has been released in v1.4.
  • In the single-instance APISIX scenario, APISIX Ingress controller cannot re-establish a connection with the dead apisix

    • https://github.com/apache/apisix-ingress-controller/pull/774 It can solve this problem and has been released in v1.4.

tao12345666333 avatar Jan 05 '22 04:01 tao12345666333

https://github.com/apache/apisix-ingress-controller/pull/760#issuecomment-1005503358

This fix cannot solve the problem in some scenarios, for example: Suppose there are two instances of apisix-ingress-controller. When the leader goes down for some reason and another instance becomes the leader, the new leader will block the resource event due to "client-side throttling" during the list resource phase. Because apisix-ingress-controller has not completed the list stage at this time, the control loop will definitely be blocked image

tao12345666333 avatar Jan 05 '22 09:01 tao12345666333

@crazyMonkey1995 I have modified this title, all of which can be considered to be related to the efficiency of the APISIX Ingress controller for event processing.

tao12345666333 avatar Jan 05 '22 09:01 tao12345666333

We have entered the v1.5 release window, you can use the latest code or wait until v1.5 is released to test and verify.

tao12345666333 avatar Jul 28 '22 04:07 tao12345666333

UPDATE: Tested with latest code(commit reference: dfcbaac8f2b8c9c5ece12e3454fa57a2a23dba65):

  1. There are 50 replicas of endpoints
  2. Use ab for 100 concurrent requests
  3. rollout restart the deployment (resulting in rolling update of endpoints)
  4. Multiple experiments did not reproduce the performance problem of endpoints update

tangzhenhuang avatar Aug 11 '22 06:08 tangzhenhuang

thanks for the update.

so can we consider this issue fixed and close it?

tao12345666333 avatar Aug 11 '22 08:08 tao12345666333

thanks for the update.

so can we consider this issue fixed and close it?

Okay.

tangzhenhuang avatar Aug 11 '22 11:08 tangzhenhuang