rabbitmq-server
rabbitmq-server copied to clipboard
Peer Discovery with DNS record wont cluster
Describe the bug
I am upgrading from 3.11 to 3.13 and my working peer discovery wont work in the new version. It discovers the A records but wont do anything with them.
| 1716362235833 | {"time":"2024-05-22 07:17:15.833049+00:00","level":"info","msg":"Addresses discovered via A records of peernodes.mq-nonpci.apse2.lab.securecall.local: 10.10.17.226, 10.10.23.236","pid":"<0.253.0>","domain":"rabbitmq"}
| 1716362235834 | {"time":"2024-05-22 07:17:15.834162+00:00","level":"info","msg":"Addresses discovered via AAAA records of peernodes.mq-nonpci.apse2.lab.securecall.local: ","pid":"<0.253.0>","domain":"rabbitmq"}
| 1716362237520 | =PROGRESS REPORT==== 22-May-2024::07:17:17.511731 ===
| 1716362237520 | supervisor: {local,inet_gethost_native_sup}
| 1716362237520 | started: [{pid,<0.97.0>},{mfa,{inet_gethost_native,init,[[]]}}]
| 1716362237521 | =PROGRESS REPORT==== 22-May-2024::07:17:17.520454 ===
| 1716362237521 | supervisor: {local,kernel_safe_sup}
| 1716362237521 | started: [{pid,<0.96.0>},
| 1716362237521 | {id,inet_gethost_native_sup},
| 1716362237521 | {mfargs,{inet_gethost_native,start_link,[]}},
| 1716362237521 | {restart_type,temporary},
| 1716362237521 | {significant,false},
| 1716362237521 | {shutdown,1000},
| 1716362237521 | {child_type,worker}]
| 1716362237609 | {"time":"2024-05-22 07:17:17.608862+00:00","level":"error","msg":"Peer discovery: could not discover and join another node; proceeding as a standalone node","line":245,"pid":"<0.253.0>","file":"rabbit_peer_discovery.erl","domain":"rabbitmq.peer_discovery","mfa":["rabbit_peer_discovery","retry_sync_desired_cluster",3]}
Reproduction steps
cluster_formation.peer_discovery_backend = dns
cluster_formation.dns.hostname =
Expected behavior
The IP addresses found would be used as peers
Additional context
This was working in 3.11 I restarted the whole cluster thinking it was an upgrade issue but they still wont find each other
I have turned up debugging and I think it is to do with my load balancer health check not bringing the node up (and putting it into DNS) until it can ping the traffic port. The new check of "is the node's IP address in the DNS list?" is tripping me up.
Yep, AWS Network Load Balancer is expecting some port available to send TCP requests to for the health check before it will put the IP address in the DNS entry. The traffic port is not available until the clustering has happened so I am stuck in a catch-22.
Is there a port which RabbitMQ has open when it starts? I tried 4369 but that doesnt seem to work either.
Otherwise we need a config item to bypass this check:
ThisNode = node(),
ThisNodeIsIncluded = lists:member(ThisNode, Nodes),
case ThisNodeIsIncluded of
true ->
ok;
false ->
?LOG_DEBUG(
"Peer discovery: not satisfyied with discovered peers: the "
"list does not contain this node",
#{domain => ?RMQLOG_DOMAIN_PEER_DISC})
end,
which is a good check but in this instance is not required or desired