apisix-ingress-controller icon indicating copy to clipboard operation
apisix-ingress-controller copied to clipboard

bug: If the admin API responds too slowly, it cannot guarantee the order in which requests are processed

Open wheatt opened this issue 1 year ago • 7 comments

Issue description

在pkg/ingress/endpoint.go文件中,syncEndpoint方法

c.controller.syncEndpoint(ctx, newestEp)

会调用dashboard接口更新upstream ip信息

resp, err := u.cluster.updateResource(ctx, url, "upstream", bytes.NewReader(body))

调用接口默认超时时间为5s,超时后会返回错误信息并立即进行下一次endpoint资源同步 但是对http请求来说,客户端超时后返回后并不会影响服务端的处理,服务端会继续处理该请求 如果下一次同步同一个endpoint资源,并且update upstream响应比较块的话,就有可能造成旧ip列表覆盖新ip列表的情况

Environment

  • your apisix-ingress-controller version (output of apisix-ingress-controller version --long):
  • your Kubernetes cluster version (output of kubectl version):
  • if you run apisix-ingress-controller in Bare-metal environment, also show your OS version (uname -a):

Minimal test code / Steps to reproduce

Actual result

第一次同步endpoint-A,ip列表为ip1、ip2、ip3 如果调用dashboard update upstream接口(记为请求1)超时,客户端会立即返回,并进行下一次同步,此时服务端还会继续处理请求1 第二次同步endpoint-A,ip列表为ip1、ip2 调用dashboard update upstream接口(记为请求2) 由于dashboard并不能保证请求1比请求2先处理完成,所以可能会造成请求1的数据覆盖请求2的数据,导致异常

Error log

none

Expected result

No response

wheatt avatar Jul 26 '22 06:07 wheatt

APISIX's admin API used by APISIX Ingress, not dashboard.

Did you actually encounter this problem?

tao12345666333 avatar Jul 26 '22 09:07 tao12345666333

In fact, in APISIX Ingress, there is periodic resynchronization. If the APISIX admin API frequently times out in the user's environment, it is recommended to increase this frequency.

tao12345666333 avatar Jul 26 '22 09:07 tao12345666333

APISIX's admin API used by APISIX Ingress, not dashboard.

Did you actually encounter this problem?

是的,线上环境碰到了这个问题,至于admin api返回慢的原因我还在查

wheatt avatar Jul 26 '22 11:07 wheatt

In fact, in APISIX Ingress, there is periodic resynchronization. If the APISIX admin API frequently times out in the user's environment, it is recommended to increase this frequency.

APISIX's admin API used by APISIX Ingress, not dashboard.

Did you actually encounter this problem?

admin API慢的问题定位到了。 admin API请求etcd时每次都会调用etcd.new()方法,这个方法每次都会选择配置文件里的第一个节点连接(issue),导致所有的连接都连到了etcd的一个节点。 我们这个etcd集群节点的max open files设置使用的系统默认值1024,连接多了之后就会响应变慢,调整了max open files之后就好了。

wheatt avatar Jul 28 '22 12:07 wheatt

写了好久英文版,写不下去了,还是用中文吧。。

wheatt avatar Jul 28 '22 12:07 wheatt

Thanks! I see that the issue has been resolved and the PR has been merged. It should be enabled in newer versions of APISIX (2.15.1

tao12345666333 avatar Jul 28 '22 16:07 tao12345666333

I saw a same question from kong.

https://mp.weixin.qq.com/s/NpbJhqFpQkqIgCOqmCMZRA

https://github.com/Kong/kong/issues/7543

xuminwlt avatar Jul 29 '22 13:07 xuminwlt

I saw a same question from kong.

https://mp.weixin.qq.com/s/NpbJhqFpQkqIgCOqmCMZRA

Kong/kong#7543

May I know how many service/route you have in kong to cause the problem ? We need some infomation to decide kong or apisix . Appreciate for help 🙏

LL1024LL avatar Sep 29 '22 08:09 LL1024LL

This issue has been marked as stale due to 90 days of inactivity. It will be closed in 30 days if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Dec 29 '22 01:12 github-actions[bot]

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

github-actions[bot] avatar Jan 28 '23 01:01 github-actions[bot]