gloo icon indicating copy to clipboard operation
gloo copied to clipboard

request still returns 200 to static upstream when dns record is deleted.

Open pszeto opened this issue 3 years ago • 6 comments

Gloo Edge Version

1.11.x (latest stable)

Kubernetes Version

1.21.x

Describe the bug

Requests through the gateway-proxy returns 200 successful after the dns record is deleted.

Steps to reproduce the bug

  1. Create a dns record gloo-test.duckdns.org and point to 34.231.5.222 which is httpbin.org
  2. Create a static upstream to with host address: gloo-test.duckdns.org
apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
  name: static-upstream
  namespace: gloo-system
spec:
  static:
    hosts:
      - addr: gloo-test.duckdns.org
        port: 80
  1. Create a VirtualService:
apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
  name: static
  namespace: gloo-system
spec:
  virtualHost:
    domains:
      - '*'
    routes:
      - matchers:
         - prefix: /
        routeAction:
          single:
            upstream:
              name: static-upstream
              namespace: gloo-system
        options:
          autoHostRewrite: true
          headerManipulation:
            requestHeadersToAdd:
            - header:
                key: went-thru-gloo
                value: "true"
  1. Do a nslookup to the gloo-test.duckdns.org and verify it's correct:
nslookup gloo-test.duckdns.org
Server:		8.8.8.8
Address:	8.8.8.8#53

Non-authoritative answer:
Name:	gloo-test.duckdns.org
Address: 34.231.5.222
  1. curl the endpoint through the gateway-proxy: curl $(glooctl proxy url)/get -v
*   Trying 35.194.94.205:80...
* Connected to 35.194.94.205 (35.194.94.205) port 80 (#0)
> GET /get HTTP/1.1
> Host: 35.194.94.205
> User-Agent: curl/7.77.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< date: Fri, 15 Apr 2022 18:40:59 GMT
< content-type: application/json
< content-length: 356
< server: envoy
< access-control-allow-origin: *
< access-control-allow-credentials: true
< x-envoy-upstream-service-time: 5
<
{
  "args": {},
  "headers": {
    "Accept": "*/*",
    "Host": "gloo-test.duckdns.org",
    "User-Agent": "curl/7.77.0",
    "Went-Thru-Gloo": "true",
    "X-Amzn-Trace-Id": "Root=1-6259bc3b-2ea5c8ce3ac145af3828e1e0",
    "X-Envoy-Expected-Rq-Timeout-Ms": "15000"
  },
  "origin": "35.188.239.109",
  "url": "http://gloo-test.duckdns.org/get"
}
* Connection #0 to host 35.194.94.205 left intact
  1. Delete the dns record. Verify the record is gone ** server can't find gloo-test.duckdns.org: NXDOMAIN
  2. Curliing the dns directly should return an error:
* Could not resolve host: gloo-test.duckdns.org
* Closing connection 0
curl: (6) Could not resolve host: gloo-test.duckdns.org
  1. curl the endpoint through the gateway-proxy: curl $(glooctl proxy url)/get -v . Even after an hour, it still returns a 200 response code.

Expected Behavior

After deleting the dns record and getting the ** server can't find gloo-test.duckdns.org: NXDOMAIN from nslookup gloo-test.duckdns.org. I expect the call to the static upstream that is referencing that host to return a 503 - no health host. Access logs from the proxy shows:

Once I reboot the gateway proxy, I get the expect 503

===================================================================================
Fri Apr 15 16:13:14 EDT 2022
nslookup gloo-test.duckdns.org
Server:		8.8.8.8
Address:	8.8.8.8#53

** server can't find gloo-test.duckdns.org: NXDOMAIN

*   Trying 35.194.94.205:80...
* Connected to 35.194.94.205 (35.194.94.205) port 80 (#0)
> GET /get HTTP/1.1
> Host: 35.194.94.205
> User-Agent: curl/7.77.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 503 Service Unavailable
< content-length: 19
< content-type: text/plain
< date: Fri, 15 Apr 2022 20:13:13 GMT
< server: envoy
<
* Connection #0 to host 35.194.94.205 left intact
no healthy upstream

Additional Context

No response

pszeto avatar Apr 15 '22 20:04 pszeto

Zendesk ticket #2894 has been linked to this issue.

soloio-bot avatar Oct 26 '23 19:10 soloio-bot

I believe what is occurring is the following: Static upstreams do not recheck for their ip configuration https://github.com/envoyproxy/envoy/blob/9fc968d757339d7c476ac890a3eea873caac5ee9/source/extensions/clusters/static/static_cluster.cc#L10 unlike say strict https://github.com/envoyproxy/envoy/blob/9fc968d757339d7c476ac890a3eea873caac5ee9/source/extensions/clusters/strict_dns/strict_dns_cluster.cc#L103

Solution: use logical or strict dns resolution

nfuden avatar Feb 05 '24 14:02 nfuden

@nfuden: Since this "works as expected", can we close this one as "Not a bug - won't fix"?

DuncanDoyle avatar Feb 20 '24 15:02 DuncanDoyle

I would argue for that. If the ux pain persists perhaps we need to add an insight around this or update our docs to call out the behavior more clearly

nfuden avatar Feb 20 '24 16:02 nfuden

@nfuden: I did find the documentation in Envoy on discovered upstreams: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/service_discovery

I know we have documentation on static upstreams in our GE docs: https://docs.solo.io/gloo-edge/latest/guides/traffic_management/destination_types/static_upstream/

... but I couldn't find anything on "Strict DNS" and/or "Logical" in our Gloo Edge docs and APIs. My guess is that the (some of) the other UpstreamSpecs (https://docs.solo.io/gloo-edge/1.7.23/reference/api/github.com/solo-io/gloo/projects/gloo/api/v1/upstream.proto.sk/) translate into this, but I lack a bit of insight here to guide the docs team (because I do think we should add something about this in our docs).

DuncanDoyle avatar Feb 20 '24 17:02 DuncanDoyle

We set strict_dns if hostname is not empty in static upstreams

nfuden avatar Feb 21 '24 21:02 nfuden

Can't reproduce on 1.15.14 when using strict_dns (which is created when you use a hostname in the static Upstream).

Reproducer project here: https://github.com/DuncanDoyle/ge-gloo-6331

DuncanDoyle avatar Mar 11 '24 12:03 DuncanDoyle

Closing. Can't reproduce.

DuncanDoyle avatar Mar 11 '24 15:03 DuncanDoyle