apisix feat: dns resolution for upstream nodes should not return IPs that are unavailable/faulty

Current Behavior

The node with exception will still be forwarded traffic.

https://github.com/apache/apisix/blob/master/apisix/utils/upstream.lua#L70

According to the code here, in the case where the upstream is LB or domain name, dns parsing will be performed, but only an IP will be returned randomly.

There is a situation in which the randomly returned node happens to be the exception node.

Expected Behavior

Abnormal nodes should be removed and should not receive traffic

Error Logs

2023/12/09 22:36:56 [error] 15767#89433274: *42241 [lua] balancer.lua:363: run(): failed to pick server: failed to find valid upstream server, all upstream servers tried while connecting to upstream, client: 127.0.0.1, server: _, request: "GET /dns/test HTTP/1.1", upstream: "http://192.168.247.4:80/dns/test", host: "127.0.0.1:9080"

Steps to Reproduce

Prepare two domain

$ dig @127.0.0.1 www.mytest.com

www.mytest.com.		0	IN	A	192.168.247.4
www.mytest.com.		0	IN	A	192.168.247.2
www.mytest.com.		0	IN	A	192.168.247.3

$ dig @127.0.0.1 www.mytemp.com

www.mytemp.com.		0	IN	A	192.168.246.3
www.mytemp.com.		0	IN	A	192.168.246.4
www.mytemp.com.		0	IN	A	192.168.246.2

both domains have a faulty node

$ curl http://192.168.247.4/
curl: (7) Failed to connect to 192.168.247.4 port 80 after 4888 ms: Couldn't connect to server

$ curl http://192.168.246.3/
curl: (7) Failed to connect to 192.168.246.3 port 80 after 4888 ms: Couldn't connect to server

the complete configuration is as follows

{
    "id": "490771170321239793",
    "create_time": 1702052012,
    "update_time": 1702132481,
    "uri": "/dns/test",
    "name": "dns_test",
    "methods": [
        "GET",
        "POST",
        "PUT",
        "DELETE",
        "PATCH",
        "HEAD",
        "OPTIONS",
        "CONNECT",
        "TRACE"
    ],
    "upstream": {
        "nodes": {
            "www.mytemp.com:80": 1,
            "www.mytest.com:80": 1
        },
        "timeout": {
            "connect": 6,
            "send": 6,
            "read": 6
        },
        "type": "roundrobin",
        "checks": {
            "active": {
                "concurrency": 10,
                "healthy": {
                    "http_statuses": [
                        200,
                        302
                    ],
                    "interval": 1,
                    "successes": 2
                },
                "http_path": "/aa",
                "port": 80,
                "timeout": 1,
                "type": "http",
                "unhealthy": {
                    "http_failures": 5,
                    "http_statuses": [
                        429,
                        404,
                        500,
                        501,
                        502,
                        503,
                        504,
                        505
                    ],
                    "interval": 1,
                    "tcp_failures": 2,
                    "timeouts": 3
                }
            }
        },
        "scheme": "http",
        "pass_host": "pass",
        "keepalive_pool": {
            "idle_timeout": 60,
            "requests": 1000,
            "size": 320
        }
    },
    "status": 1
}

Initiate a request

curl http://127.0.0.1:9080/dns/test -i

there is a certain probability that an error will occur as follows

HTTP/1.1 502 Bad Gateway
Date: Sat, 09 Dec 2023 14:36:21 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 154
Connection: keep-alive
Server: APISIX/3.7.0
X-APISIX-Upstream-Status: 504 :

<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>openresty</center>
</body>
</html>

Environment

APISIX version (run apisix version): APISIX/3.7.0
Operating system (run uname -a): Darwin
OpenResty / Nginx version (run openresty -V or nginx -V): nginx version: openresty/1.21.4.2
etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info):
APISIX Dashboard version, if relevant:
Plugin runner version, for issues related to plugin runners:
LuaRocks version, for installation issues (run luarocks --version):

Dec 09 '23 15:12 zuiyangqingzhou

The healtcheck API requires some number of tries before the node is marked as unhealthy and removed. If both of your nodes are unavailable then it is guarateed that your reqeust will fail either way (by healtcheck mechanism or not).

Dec 10 '23 11:12 shreemaan-abhishek

The healtcheck API requires some number of tries before the node is marked as unhealthy and removed. If both of your nodes are unavailable then it is guarateed that your reqeust will fail either way (by healtcheck mechanism or not).

I know what you mean, but for the same domain, although there are other nodes available, it cannot guarantee that the traffic will always be forwarded to healthy nodes, because the IP addresses resolved by domain are random.

Dec 10 '23 13:12 zuiyangqingzhou

@shreemaan-abhishek , I would like to debug this.

Dec 27 '23 12:12 sheharyaar

I was able to reproduce the issue, but this is not a bug, APISIX can only be aware of a failed node once it is unable to connect to the service, so ignoring the bad nodes even the first time, this should be a feature rather than a bug.

Jan 01 '24 06:01 sheharyaar

hey, could you tell me the steps that you performed to set up the domains locally on your system? cc: @sheharyaar @zuiyangqingzhou

Apr 25 '24 06:04 nitishfy

@zuiyangqingzhou, do these domains still contain one faulty node?

Apr 25 '24 06:04 shreemaan-abhishek

@zuiyangqingzhou, do these domains still contain one faulty node?

Yes, you can refer to this.

May 15 '24 02:05 zuiyangqingzhou

hey, could you tell me the steps that you performed to set up the domains locally on your system? cc: @sheharyaar @zuiyangqingzhou

You can use dnsmasq to build your local domain name resolution @nitishfy

May 15 '24 02:05 zuiyangqingzhou

The usage of domain names in the upstream node makes it impossible to distinguish the healthy nodes. Prometheus scrapes the metric apisix_upstream_status using IP addresses instead of domain names, leaving us unaware of the corresponding node. Can't we just use domain name instead of IP in the healthcheck API? @shreemaan-abhishek @sheharyaar

Jul 11 '24 02:07 gliffcheung

If two domains have the same ip, apisix may even use the domain name of an unhealthy node. https://github.com/apache/apisix/blob/master/apisix/balancer.lua#L110

        for _, node in ipairs(nodes) do
            if node.domain then
                local addr = node.host .. ":" .. node.port
                addr_to_domain[addr] = node.domain
            end
        end

https://github.com/apache/apisix/blob/master/apisix/balancer.lua#L261

    local domain = server_picker.addr_to_domain[server]
    
    res.domain = domain

Jul 11 '24 02:07 gliffcheung