nomad icon indicating copy to clipboard operation
nomad copied to clipboard

fix wrong control plane discovery, when serf advertised address located in network, unreachable from clients

Open tantra35 opened this issue 3 years ago • 3 comments

Suppose follow configuration, when we use nomad federation:

  1. We have 2 nomad regions (A and B)
  2. controlplane of regions located in common network: 10.168.0.0/16, rpc and http addresses located in private networks, unique for both regions(192.168.102.0/24 for region A, and 192.168.103.0/24 for region B)
  3. nomad clients located only in private networks and can't communicate with controlplane throw 10.168.0.0/24
  4. on controllplane we advertise addreses like this:
    1. for region A on first server
      advertise
      {
          http = "192.168.102.11"
          rpc =  "192.168.102.11"
          serf = "10.168.0.11"
      } 
      
    2. for region B on first server
      advertise
      {
          http = "192.168.103.11"
          rpc =  "192.168.103.11"
          serf = "10.168.1.11"
      } 
      

In such configuration nomad clients when discover controlplane by consul will get addresses from network 10.168.0.0/16, and of course nothing will work, because they can't reach this network

This patch try fix this situation by adding new RPC method RpcPeers which will return private RPC adresses instead of Serf

tantra35 avatar Jan 21 '22 10:01 tantra35

@tgross Hm just tested on my test stand and as expected peers return wrong addresses:

vagrant@consulnomad-11x-1:~/nomad$ curl -H "X-Nomad-Token: 5f1b4db3-8110-de8f-486e-3c8c806e1912" -s "http://localhost:4646/v1/status/peers"; echo
["10.168.0.12:4647","10.168.0.13:4647","10.168.0.11:4647"]

our nomad configs:

  1. server.hcl
    datacenter = "test"
    data_dir = "/var/lib/nomad/"
    
    disable_update_check = true
    
    enable_syslog = false
    log_level = "INFO"
    
    server
    {
      enabled          = true
    
      raft_protocol    = 3
      bootstrap_expect = 3
    
    
    }
    
  2. acl.hcl
    acl
    {
      enabled = true
      }
    
  3. consul.hcl
    consul
    {
      token = "018e6c3c-7c4a-4793-b9fd-d8cde8618269"
      }
    
  4. advertise.hcl (its depends based on server)
    advertise
    {
            http = "192.168.102.11"
            rpc = "192.168.102.11"
            serf = "10.168.0.11"
    }
    

On this stand i up only controlplane, without any clients or anything else, just 3 controlplane servers(nomad + consul). If it make sense consul also advertise wan address like this:

/opt/consul/consul agent -config-dir=/etc/consul -advertise=192.168.102.11 -advertise-wan=10.168.0.11

If it will be useful i can provide my Vagrant stand

tantra35 avatar Jan 24 '22 16:01 tantra35

@tgross I having rebuilded test stand where only nomad installed(consul not installed in that stand), and I don't understand why you get the right result and I don't? server config now holds join:

datacenter = "test"
data_dir = "/var/lib/nomad/"

disable_update_check = true

enable_syslog = false
log_level = "INFO"

server
{
  enabled          = true

  raft_protocol    = 3
  bootstrap_expect = 3


  server_join {
    retry_join = ["192.168.102.11", "192.168.102.12", "192.168.102.13"]

  }
}

then i try to get server mebers (tokens that present regenerated every time when started new stand)

vagrant@consulnomad-11x-1:~/nomad$ nomad server members -token=4225fc30-0caf-6b23-7a2d-4f4961a8111a
Name                      Address      Port  Status  Leader  Protocol  Build   Datacenter  Region
consulnomad-11x-1.global  10.168.0.11  4648  alive   false   2         1.1.10  test        global
consulnomad-11x-2.global  10.168.0.12  4648  alive   true    2         1.1.10  test        global
consulnomad-11x-3.global  10.168.0.13  4648  alive   false   2         1.1.10  test        global

then try to call Status.Peers via REST interface

vagrant@consulnomad-11x-1:~/nomad$ curl -H "X-Nomad-Token: 4225fc30-0caf-6b23-7a2d-4f4961a8111a" -s "http://localhost:4646/v1/status/peers"; echo
["10.168.0.12:4647","10.168.0.11:4647","10.168.0.13:4647"]

And imo this is absolutely expected behavior, I've never seen nomad work any other way,

Maybe something changed between master and 1.1.10 versions? But its strange because i doesn't see significant changes between between versions which would address these points

tantra35 avatar Jan 25 '22 11:01 tantra35

CLA assistant check
All committers have signed the CLA.

hashicorp-cla avatar Mar 12 '22 16:03 hashicorp-cla

I'm going to close this PR in lieu of https://github.com/hashicorp/nomad/pull/16217. https://github.com/hashicorp/nomad/issues/16211 was opened recently and this gave us some extra clues we needed to reproduce this problem and ensure we had the right fix. Sorry this got left for so long!

tgross avatar Feb 17 '23 20:02 tgross