request still returns 200 to static upstream when dns record is deleted.
Gloo Edge Version
1.11.x (latest stable)
Kubernetes Version
1.21.x
Describe the bug
Requests through the gateway-proxy returns 200 successful after the dns record is deleted.
Steps to reproduce the bug
- Create a dns record
gloo-test.duckdns.organd point to34.231.5.222which is httpbin.org - Create a static upstream to with host address:
gloo-test.duckdns.org
apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
name: static-upstream
namespace: gloo-system
spec:
static:
hosts:
- addr: gloo-test.duckdns.org
port: 80
- Create a VirtualService:
apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
name: static
namespace: gloo-system
spec:
virtualHost:
domains:
- '*'
routes:
- matchers:
- prefix: /
routeAction:
single:
upstream:
name: static-upstream
namespace: gloo-system
options:
autoHostRewrite: true
headerManipulation:
requestHeadersToAdd:
- header:
key: went-thru-gloo
value: "true"
- Do a nslookup to the
gloo-test.duckdns.organd verify it's correct:
nslookup gloo-test.duckdns.org
Server: 8.8.8.8
Address: 8.8.8.8#53
Non-authoritative answer:
Name: gloo-test.duckdns.org
Address: 34.231.5.222
- curl the endpoint through the gateway-proxy:
curl $(glooctl proxy url)/get -v
* Trying 35.194.94.205:80...
* Connected to 35.194.94.205 (35.194.94.205) port 80 (#0)
> GET /get HTTP/1.1
> Host: 35.194.94.205
> User-Agent: curl/7.77.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< date: Fri, 15 Apr 2022 18:40:59 GMT
< content-type: application/json
< content-length: 356
< server: envoy
< access-control-allow-origin: *
< access-control-allow-credentials: true
< x-envoy-upstream-service-time: 5
<
{
"args": {},
"headers": {
"Accept": "*/*",
"Host": "gloo-test.duckdns.org",
"User-Agent": "curl/7.77.0",
"Went-Thru-Gloo": "true",
"X-Amzn-Trace-Id": "Root=1-6259bc3b-2ea5c8ce3ac145af3828e1e0",
"X-Envoy-Expected-Rq-Timeout-Ms": "15000"
},
"origin": "35.188.239.109",
"url": "http://gloo-test.duckdns.org/get"
}
* Connection #0 to host 35.194.94.205 left intact
- Delete the dns record. Verify the record is gone
** server can't find gloo-test.duckdns.org: NXDOMAIN - Curliing the dns directly should return an error:
* Could not resolve host: gloo-test.duckdns.org
* Closing connection 0
curl: (6) Could not resolve host: gloo-test.duckdns.org
- curl the endpoint through the gateway-proxy:
curl $(glooctl proxy url)/get -v. Even after an hour, it still returns a 200 response code.
Expected Behavior
After deleting the dns record and getting the ** server can't find gloo-test.duckdns.org: NXDOMAIN from nslookup gloo-test.duckdns.org. I expect the call to the static upstream that is referencing that host to return a 503 - no health host. Access logs from the proxy shows:
Once I reboot the gateway proxy, I get the expect 503
===================================================================================
Fri Apr 15 16:13:14 EDT 2022
nslookup gloo-test.duckdns.org
Server: 8.8.8.8
Address: 8.8.8.8#53
** server can't find gloo-test.duckdns.org: NXDOMAIN
* Trying 35.194.94.205:80...
* Connected to 35.194.94.205 (35.194.94.205) port 80 (#0)
> GET /get HTTP/1.1
> Host: 35.194.94.205
> User-Agent: curl/7.77.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 503 Service Unavailable
< content-length: 19
< content-type: text/plain
< date: Fri, 15 Apr 2022 20:13:13 GMT
< server: envoy
<
* Connection #0 to host 35.194.94.205 left intact
no healthy upstream
Additional Context
No response
Zendesk ticket #2894 has been linked to this issue.
I believe what is occurring is the following: Static upstreams do not recheck for their ip configuration https://github.com/envoyproxy/envoy/blob/9fc968d757339d7c476ac890a3eea873caac5ee9/source/extensions/clusters/static/static_cluster.cc#L10 unlike say strict https://github.com/envoyproxy/envoy/blob/9fc968d757339d7c476ac890a3eea873caac5ee9/source/extensions/clusters/strict_dns/strict_dns_cluster.cc#L103
Solution: use logical or strict dns resolution
@nfuden: Since this "works as expected", can we close this one as "Not a bug - won't fix"?
I would argue for that. If the ux pain persists perhaps we need to add an insight around this or update our docs to call out the behavior more clearly
@nfuden: I did find the documentation in Envoy on discovered upstreams: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/service_discovery
I know we have documentation on static upstreams in our GE docs: https://docs.solo.io/gloo-edge/latest/guides/traffic_management/destination_types/static_upstream/
... but I couldn't find anything on "Strict DNS" and/or "Logical" in our Gloo Edge docs and APIs. My guess is that the (some of) the other UpstreamSpecs (https://docs.solo.io/gloo-edge/1.7.23/reference/api/github.com/solo-io/gloo/projects/gloo/api/v1/upstream.proto.sk/) translate into this, but I lack a bit of insight here to guide the docs team (because I do think we should add something about this in our docs).
We set strict_dns if hostname is not empty in static upstreams
Can't reproduce on 1.15.14 when using strict_dns (which is created when you use a hostname in the static Upstream).
Reproducer project here: https://github.com/DuncanDoyle/ge-gloo-6331
Closing. Can't reproduce.