apisix icon indicating copy to clipboard operation
apisix copied to clipboard

bug: access upstream 502

Open cataglyphis opened this issue 1 year ago • 1 comments

Current Behavior

We find that sometimes Apisix access upstream got 502 bad gateway error even if upstream exist. Error Rate:

  • 1/42458 in 15 minutes
  • 500+ times in 1 day

CleanShot 2024-10-11 at 19 23 34@2x

CleanShot 2024-10-11 at 19 24 05@2x

CleanShot 2024-10-11 at 19 25 56@2x

Expected Behavior

No response

Error Logs

No response

Steps to Reproduce

  1. access upstream through apisix

Environment

  • APISIX version (run apisix version): 3.2.2
  • Operating system (run uname -a): Linux apisix-9d7bb89b8-2cv6k 5.10.134-16.1.al8.x86_64 #1 SMP Thu Dec 7 14:11:24 UTC 2023 x86_64 GNU/Linux
  • OpenResty / Nginx version (run openresty -V or nginx -V): openresty/1.21.4.1
  • etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info): 3.5.10
  • APISIX Dashboard version, if relevant:
  • Plugin runner version, for issues related to plugin runners:
  • LuaRocks version, for installation issues (run luarocks --version):

cataglyphis avatar Oct 11 '24 11:10 cataglyphis

we encountered a similar issue previously.

The root of the problem was that APISIX was using HTTP/1.1 to proxy requests to the upstream server, and the keepalive_timeout was set to 60 seconds:

proxy_http_version 1.1;
keepalive_timeout 60s;

If the upstream server is configured with keepalive, such as Gunicorn in Python (which has a default keepalive_timeout of 2 seconds), this can lead to issues. In the APISIX <-> upstream setup, the upstream server closes the connection after 2 seconds. However, APISIX is unaware of this, and when it attempts to use this closed connection for the next request, it results in a 502 error.

Solutions to this issue include:

  • Disabling keepalive on the upstream server.
  • Or, setting the upstream server’s keepalive timeout to be greater than 60 seconds.

wklken avatar Oct 15 '24 07:10 wklken

我们以前遇到过类似的问题。

问题的根源在于 APISIX 使用 HTTP/1.1 代理对上游服务器的请求,并且 keepalive_timeout 设置为 60 秒:

proxy_http_version 1.1;
keepalive_timeout 60s;

如果上游服务器配置了keepalive但是,例如Python中的Gunicorn(其默认keepalive_timeout为2秒),则可能会导致问题。在APISIX <->上游设置中,上游服务器会在2秒后关闭连接。,APISIX并不知道这一点,当它尝试关闭上面的连接用于下一个请求时,会导致 502 错误。

解决该问题的方法包括:

  • 在上游服务器上禁用keepalive。
  • 或者,将上游服务器的保持活动超时设置为大于60秒。

我们没法改业务,在upstream的配置上把idle_timeout改成0了,可惜效果不显著,还有502

random-zhu avatar Dec 27 '24 10:12 random-zhu

@random-zhu

some other solutions you can try:

  1. use retries for upstream https://apisix.apache.org/docs/apisix/admin-api/#request-body-parameters-4
  2. use nginx proxy_next_upstream for http_502;

wklken avatar Dec 28 '24 02:12 wklken