nomad
nomad copied to clipboard
fix wrong control plane discovery, when serf advertised address located in network, unreachable from clients
Suppose follow configuration, when we use nomad federation:
- We have 2 nomad regions (
AandB) - controlplane of regions located in common network:
10.168.0.0/16, rpc and http addresses located in private networks, unique for both regions(192.168.102.0/24for regionA, and192.168.103.0/24for regionB) - nomad clients located only in private networks and can't communicate with controlplane throw
10.168.0.0/24 - on controllplane we advertise addreses like this:
- for region
Aon first serveradvertise { http = "192.168.102.11" rpc = "192.168.102.11" serf = "10.168.0.11" } - for region
Bon first serveradvertise { http = "192.168.103.11" rpc = "192.168.103.11" serf = "10.168.1.11" }
- for region
In such configuration nomad clients when discover controlplane by consul will get addresses from network 10.168.0.0/16, and of course nothing will work, because they can't reach this network
This patch try fix this situation by adding new RPC method RpcPeers which will return private RPC adresses instead of Serf
@tgross Hm just tested on my test stand and as expected peers return wrong addresses:
vagrant@consulnomad-11x-1:~/nomad$ curl -H "X-Nomad-Token: 5f1b4db3-8110-de8f-486e-3c8c806e1912" -s "http://localhost:4646/v1/status/peers"; echo
["10.168.0.12:4647","10.168.0.13:4647","10.168.0.11:4647"]
our nomad configs:
server.hcldatacenter = "test" data_dir = "/var/lib/nomad/" disable_update_check = true enable_syslog = false log_level = "INFO" server { enabled = true raft_protocol = 3 bootstrap_expect = 3 }acl.hclacl { enabled = true }consul.hclconsul { token = "018e6c3c-7c4a-4793-b9fd-d8cde8618269" }advertise.hcl(its depends based on server)advertise { http = "192.168.102.11" rpc = "192.168.102.11" serf = "10.168.0.11" }
On this stand i up only controlplane, without any clients or anything else, just 3 controlplane servers(nomad + consul). If it make sense consul also advertise wan address like this:
/opt/consul/consul agent -config-dir=/etc/consul -advertise=192.168.102.11 -advertise-wan=10.168.0.11
If it will be useful i can provide my Vagrant stand
@tgross I having rebuilded test stand where only nomad installed(consul not installed in that stand), and I don't understand why you get the right result and I don't? server config now holds join:
datacenter = "test"
data_dir = "/var/lib/nomad/"
disable_update_check = true
enable_syslog = false
log_level = "INFO"
server
{
enabled = true
raft_protocol = 3
bootstrap_expect = 3
server_join {
retry_join = ["192.168.102.11", "192.168.102.12", "192.168.102.13"]
}
}
then i try to get server mebers (tokens that present regenerated every time when started new stand)
vagrant@consulnomad-11x-1:~/nomad$ nomad server members -token=4225fc30-0caf-6b23-7a2d-4f4961a8111a
Name Address Port Status Leader Protocol Build Datacenter Region
consulnomad-11x-1.global 10.168.0.11 4648 alive false 2 1.1.10 test global
consulnomad-11x-2.global 10.168.0.12 4648 alive true 2 1.1.10 test global
consulnomad-11x-3.global 10.168.0.13 4648 alive false 2 1.1.10 test global
then try to call Status.Peers via REST interface
vagrant@consulnomad-11x-1:~/nomad$ curl -H "X-Nomad-Token: 4225fc30-0caf-6b23-7a2d-4f4961a8111a" -s "http://localhost:4646/v1/status/peers"; echo
["10.168.0.12:4647","10.168.0.11:4647","10.168.0.13:4647"]
And imo this is absolutely expected behavior, I've never seen nomad work any other way,
Maybe something changed between master and 1.1.10 versions? But its strange because i doesn't see significant changes between between versions which would address these points
I'm going to close this PR in lieu of https://github.com/hashicorp/nomad/pull/16217. https://github.com/hashicorp/nomad/issues/16211 was opened recently and this gave us some extra clues we needed to reproduce this problem and ensure we had the right fix. Sorry this got left for so long!