feat: dns resolution for upstream nodes should not return IPs that are unavailable/faulty
Current Behavior
The node with exception will still be forwarded traffic.
https://github.com/apache/apisix/blob/master/apisix/utils/upstream.lua#L70
According to the code here, in the case where the upstream is LB or domain name, dns parsing will be performed, but only an IP will be returned randomly.
There is a situation in which the randomly returned node happens to be the exception node.
Expected Behavior
Abnormal nodes should be removed and should not receive traffic
Error Logs
2023/12/09 22:36:56 [error] 15767#89433274: *42241 [lua] balancer.lua:363: run(): failed to pick server: failed to find valid upstream server, all upstream servers tried while connecting to upstream, client: 127.0.0.1, server: _, request: "GET /dns/test HTTP/1.1", upstream: "http://192.168.247.4:80/dns/test", host: "127.0.0.1:9080"
Steps to Reproduce
- Prepare two domain
$ dig @127.0.0.1 www.mytest.com
www.mytest.com. 0 IN A 192.168.247.4
www.mytest.com. 0 IN A 192.168.247.2
www.mytest.com. 0 IN A 192.168.247.3
$ dig @127.0.0.1 www.mytemp.com
www.mytemp.com. 0 IN A 192.168.246.3
www.mytemp.com. 0 IN A 192.168.246.4
www.mytemp.com. 0 IN A 192.168.246.2
- both domains have a faulty node
$ curl http://192.168.247.4/
curl: (7) Failed to connect to 192.168.247.4 port 80 after 4888 ms: Couldn't connect to server
$ curl http://192.168.246.3/
curl: (7) Failed to connect to 192.168.246.3 port 80 after 4888 ms: Couldn't connect to server
- the complete configuration is as follows
{
"id": "490771170321239793",
"create_time": 1702052012,
"update_time": 1702132481,
"uri": "/dns/test",
"name": "dns_test",
"methods": [
"GET",
"POST",
"PUT",
"DELETE",
"PATCH",
"HEAD",
"OPTIONS",
"CONNECT",
"TRACE"
],
"upstream": {
"nodes": {
"www.mytemp.com:80": 1,
"www.mytest.com:80": 1
},
"timeout": {
"connect": 6,
"send": 6,
"read": 6
},
"type": "roundrobin",
"checks": {
"active": {
"concurrency": 10,
"healthy": {
"http_statuses": [
200,
302
],
"interval": 1,
"successes": 2
},
"http_path": "/aa",
"port": 80,
"timeout": 1,
"type": "http",
"unhealthy": {
"http_failures": 5,
"http_statuses": [
429,
404,
500,
501,
502,
503,
504,
505
],
"interval": 1,
"tcp_failures": 2,
"timeouts": 3
}
}
},
"scheme": "http",
"pass_host": "pass",
"keepalive_pool": {
"idle_timeout": 60,
"requests": 1000,
"size": 320
}
},
"status": 1
}
- Initiate a request
curl http://127.0.0.1:9080/dns/test -i
- there is a certain probability that an error will occur as follows
HTTP/1.1 502 Bad Gateway
Date: Sat, 09 Dec 2023 14:36:21 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 154
Connection: keep-alive
Server: APISIX/3.7.0
X-APISIX-Upstream-Status: 504 :
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>openresty</center>
</body>
</html>
Environment
- APISIX version (run
apisix version): APISIX/3.7.0 - Operating system (run
uname -a): Darwin - OpenResty / Nginx version (run
openresty -Vornginx -V): nginx version: openresty/1.21.4.2 - etcd version, if relevant (run
curl http://127.0.0.1:9090/v1/server_info): - APISIX Dashboard version, if relevant:
- Plugin runner version, for issues related to plugin runners:
- LuaRocks version, for installation issues (run
luarocks --version):
The healtcheck API requires some number of tries before the node is marked as unhealthy and removed. If both of your nodes are unavailable then it is guarateed that your reqeust will fail either way (by healtcheck mechanism or not).
The healtcheck API requires some number of tries before the node is marked as unhealthy and removed. If both of your nodes are unavailable then it is guarateed that your reqeust will fail either way (by healtcheck mechanism or not).
I know what you mean, but for the same domain, although there are other nodes available, it cannot guarantee that the traffic will always be forwarded to healthy nodes, because the IP addresses resolved by domain are random.
@shreemaan-abhishek , I would like to debug this.
I was able to reproduce the issue, but this is not a bug, APISIX can only be aware of a failed node once it is unable to connect to the service, so ignoring the bad nodes even the first time, this should be a feature rather than a bug.
hey, could you tell me the steps that you performed to set up the domains locally on your system? cc: @sheharyaar @zuiyangqingzhou
@zuiyangqingzhou, do these domains still contain one faulty node?
@zuiyangqingzhou, do these domains still contain one faulty node?
Yes, you can refer to this.
hey, could you tell me the steps that you performed to set up the domains locally on your system? cc: @sheharyaar @zuiyangqingzhou
You can use dnsmasq to build your local domain name resolution @nitishfy
The usage of domain names in the upstream node makes it impossible to distinguish the healthy nodes.
Prometheus scrapes the metric apisix_upstream_status using IP addresses instead of domain names, leaving us unaware of the corresponding node.
Can't we just use domain name instead of IP in the healthcheck API? @shreemaan-abhishek @sheharyaar
If two domains have the same ip, apisix may even use the domain name of an unhealthy node. https://github.com/apache/apisix/blob/master/apisix/balancer.lua#L110
for _, node in ipairs(nodes) do
if node.domain then
local addr = node.host .. ":" .. node.port
addr_to_domain[addr] = node.domain
end
end
https://github.com/apache/apisix/blob/master/apisix/balancer.lua#L261
local domain = server_picker.addr_to_domain[server]
res.domain = domain
