consul icon indicating copy to clipboard operation
consul copied to clipboard

Strange DNS behavior when using External Services with CNAME records

Open jstaro opened this issue 3 years ago • 11 comments

Note: I posted my first findings about this at the forum, but received no response. I have now made a new repro.

Overview of the Issue

I have noted some kind of discrepancy in how DNS resolving (w/ recursion) works with external services

Reproduction Steps

Prerequisites:

  • Running Consul in K8s (installed via the official Helm chart)
  • coredns settings changed according to documentation to enable Consul DNS for *.consul URLs in K8s
  • ACL off
  • Recursor 8.8.8.8 added
  1. add the following service with minimal needed config (I guess the NodeMeta stuff is just for consul-esm) from the documentation:
     {
      "Node": "google",
      "Address": "www.google.com",
      "NodeMeta": {
        "external-node": "true",
        "external-probe": "true"
      },
      "Service": {
        "Service": "search",
        "Port": 80
      }
    }
    
  2. attach to a running pod and do ping search.service.consul. It succeeds.
  3. now add the following service:
     {
      "Node": "nytimes",
      "Address": "www.nytimes.com",
      "NodeMeta": {
        "external-node": "true",
        "external-probe": "true"
      },
      "Service": {
        "Service": "nyt",
        "Port": 80
      }
    }
    
  4. attach to a running pod and do ping nyt.service.consul. It fails with 'bad address'.
  5. doing a ping nytimes.node.consul succeeds

The difference I can see here is that www.nytimes.com points to a CNAME (which in turn points to another CNAME) (since it sits behind a CDN) whereas www.google.com returns an A record directly.

Pinging the node addresses <node>.node.consul always works, but this won't really work for me since my users would have to keep track of which services are external services (and use node addresses) and which ones are internal (and use service addresses) in different environments.

This feels like there's some kind of recursion limit when using service URLs that is not in effect when using node URLs.

Consul info for both Client and Server

Client info
agent:
        check_monitors = 0 
        check_ttls = 0     
        checks = 0
        services = 0       
build:
        prerelease =       
        revision = 27de64da
        version = 1.10.0   
consul:
        acl = disabled     
        known_servers = 3  
        server = false     
runtime:
        arch = amd64       
        cpu_count = 8      
        goroutines = 60    
        max_procs = 8        
        os = linux
        version = go1.16.5   
serf_lan:
        coordinate_resets = 0
        encrypted = false    
        event_queue = 0      
        event_time = 40      
        failed = 0       
        health_score = 0 
        intent_queue = 0 
        left = 0
        member_time = 924
        members = 6      
        query_queue = 0  
        query_time = 1   
Server info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease =
        revision = 27de64da
        version = 1.10.0
consul:
        acl = disabled
        bootstrap = false
        known_datacenters = 1
        leader = true
        leader_addr = 10.3.42.196:8300
        server = true
raft:
        applied_index = 28529296      
        commit_index = 28529296       
        fsm_pending = 0
        last_contact = 0
        last_log_index = 28529297     
        last_log_term = 310
        last_snapshot_index = 28516540
        last_snapshot_term = 310
        latest_configuration = [{Suffrage:Voter ID:3a891384-8162-4a94-a9b6-58b06e340d7a Address:10.3.41.87:8300} {Suffrage:Voter ID:1025407a-2ac4-43c8-900d-76b1de854648 Address:10.3.42.196:8300} {Suffrage:Voter ID:43cb
cd81-817b-0b82-29f1-b859e572587b Address:10.3.41.223:8300}]
        latest_configuration_index = 0
        num_peers = 2
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Leader
        term = 310
runtime:
        arch = amd64
        cpu_count = 8
        goroutines = 176
        max_procs = 8
        os = linux
        version = go1.16.5
serf_lan:
        coordinate_resets = 0
        encrypted = false    
        event_queue = 0      
        event_time = 40
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 924
        members = 6
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 392
        members = 3
        query_queue = 0
        query_time = 1

Operating system and Environment details

Running on AKS (Azure)

jstaro avatar Jul 05 '21 09:07 jstaro