nomad fix wrong control plane discovery, when serf advertised address located in network, unreachable from clients

Suppose follow configuration, when we use nomad federation:

We have 2 nomad regions (A and B)
controlplane of regions located in common network: 10.168.0.0/16, rpc and http addresses located in private networks, unique for both regions(192.168.102.0/24 for region A, and 192.168.103.0/24 for region B)
nomad clients located only in private networks and can't communicate with controlplane throw 10.168.0.0/24

on controllplane we advertise addreses like this:

for region A on first server

advertise
{
    http = "192.168.102.11"
    rpc =  "192.168.102.11"
    serf = "10.168.0.11"
}

for region B on first server

advertise
{
    http = "192.168.103.11"
    rpc =  "192.168.103.11"
    serf = "10.168.1.11"
}

In such configuration nomad clients when discover controlplane by consul will get addresses from network 10.168.0.0/16, and of course nothing will work, because they can't reach this network

This patch try fix this situation by adding new RPC method RpcPeers which will return private RPC adresses instead of Serf

Jan 21 '22 10:01 tantra35

@tgross Hm just tested on my test stand and as expected peers return wrong addresses:

vagrant@consulnomad-11x-1:~/nomad$ curl -H "X-Nomad-Token: 5f1b4db3-8110-de8f-486e-3c8c806e1912" -s "http://localhost:4646/v1/status/peers"; echo
["10.168.0.12:4647","10.168.0.13:4647","10.168.0.11:4647"]

our nomad configs:

server.hcl

datacenter = "test"
data_dir = "/var/lib/nomad/"

disable_update_check = true

enable_syslog = false
log_level = "INFO"

server
{
  enabled          = true

  raft_protocol    = 3
  bootstrap_expect = 3


}

acl.hcl
```
acl
{
  enabled = true
  }
```

consul.hcl

consul
{
  token = "018e6c3c-7c4a-4793-b9fd-d8cde8618269"
  }

advertise.hcl (its depends based on server)

advertise
{
        http = "192.168.102.11"
        rpc = "192.168.102.11"
        serf = "10.168.0.11"
}

On this stand i up only controlplane, without any clients or anything else, just 3 controlplane servers(nomad + consul). If it make sense consul also advertise wan address like this:

/opt/consul/consul agent -config-dir=/etc/consul -advertise=192.168.102.11 -advertise-wan=10.168.0.11

If it will be useful i can provide my Vagrant stand

Jan 24 '22 16:01 tantra35

@tgross I having rebuilded test stand where only nomad installed(consul not installed in that stand), and I don't understand why you get the right result and I don't? server config now holds join:

datacenter = "test"
data_dir = "/var/lib/nomad/"

disable_update_check = true

enable_syslog = false
log_level = "INFO"

server
{
  enabled          = true

  raft_protocol    = 3
  bootstrap_expect = 3


  server_join {
    retry_join = ["192.168.102.11", "192.168.102.12", "192.168.102.13"]

  }
}

then i try to get server mebers (tokens that present regenerated every time when started new stand)

vagrant@consulnomad-11x-1:~/nomad$ nomad server members -token=4225fc30-0caf-6b23-7a2d-4f4961a8111a
Name                      Address      Port  Status  Leader  Protocol  Build   Datacenter  Region
consulnomad-11x-1.global  10.168.0.11  4648  alive   false   2         1.1.10  test        global
consulnomad-11x-2.global  10.168.0.12  4648  alive   true    2         1.1.10  test        global
consulnomad-11x-3.global  10.168.0.13  4648  alive   false   2         1.1.10  test        global

then try to call Status.Peers via REST interface

vagrant@consulnomad-11x-1:~/nomad$ curl -H "X-Nomad-Token: 4225fc30-0caf-6b23-7a2d-4f4961a8111a" -s "http://localhost:4646/v1/status/peers"; echo
["10.168.0.12:4647","10.168.0.11:4647","10.168.0.13:4647"]

And imo this is absolutely expected behavior, I've never seen nomad work any other way,

Maybe something changed between master and 1.1.10 versions? But its strange because i doesn't see significant changes between between versions which would address these points

Jan 25 '22 11:01 tantra35

All committers have signed the CLA.

Mar 12 '22 16:03 hashicorp-cla

I'm going to close this PR in lieu of https://github.com/hashicorp/nomad/pull/16217. https://github.com/hashicorp/nomad/issues/16211 was opened recently and this gave us some extra clues we needed to reproduce this problem and ensure we had the right fix. Sorry this got left for so long!

Feb 17 '23 20:02 tgross

nomad nomad copied to clipboard

fix wrong control plane discovery, when serf advertised address located in network, unreachable from clients

nomad
nomad copied to clipboard