rabbitmq-server Peer Discovery with DNS record wont cluster

trafficstars

Describe the bug

I am upgrading from 3.11 to 3.13 and my working peer discovery wont work in the new version. It discovers the A records but wont do anything with them.

| 1716362235833 | {"time":"2024-05-22 07:17:15.833049+00:00","level":"info","msg":"Addresses discovered via A records of peernodes.mq-nonpci.apse2.lab.securecall.local: 10.10.17.226, 10.10.23.236","pid":"<0.253.0>","domain":"rabbitmq"}
| 1716362235834 | {"time":"2024-05-22 07:17:15.834162+00:00","level":"info","msg":"Addresses discovered via AAAA records of peernodes.mq-nonpci.apse2.lab.securecall.local: ","pid":"<0.253.0>","domain":"rabbitmq"}
| 1716362237520 | =PROGRESS REPORT==== 22-May-2024::07:17:17.511731 ===
| 1716362237520 | supervisor: {local,inet_gethost_native_sup}
| 1716362237520 | started: [{pid,<0.97.0>},{mfa,{inet_gethost_native,init,[[]]}}]
| 1716362237521 | =PROGRESS REPORT==== 22-May-2024::07:17:17.520454 ===
| 1716362237521 | supervisor: {local,kernel_safe_sup}
| 1716362237521 | started: [{pid,<0.96.0>},
| 1716362237521 | {id,inet_gethost_native_sup},
| 1716362237521 | {mfargs,{inet_gethost_native,start_link,[]}},
| 1716362237521 | {restart_type,temporary},
| 1716362237521 | {significant,false},
| 1716362237521 | {shutdown,1000},
| 1716362237521 | {child_type,worker}]
| 1716362237609 | {"time":"2024-05-22 07:17:17.608862+00:00","level":"error","msg":"Peer discovery: could not discover and join another node; proceeding as a standalone node","line":245,"pid":"<0.253.0>","file":"rabbit_peer_discovery.erl","domain":"rabbitmq.peer_discovery","mfa":["rabbit_peer_discovery","retry_sync_desired_cluster",3]}

Reproduction steps

cluster_formation.peer_discovery_backend = dns cluster_formation.dns.hostname = cluster_formation.discovery_retry_interval = 500 cluster_formation.discovery_retry_limit = 10 cluster_partition_handling = autoheal

Expected behavior

The IP addresses found would be used as peers

Additional context

This was working in 3.11 I restarted the whole cluster thinking it was an upgrade issue but they still wont find each other

May 22 '24 07:05 womblep

I have turned up debugging and I think it is to do with my load balancer health check not bringing the node up (and putting it into DNS) until it can ping the traffic port. The new check of "is the node's IP address in the DNS list?" is tripping me up.

May 22 '24 08:05 womblep

Yep, AWS Network Load Balancer is expecting some port available to send TCP requests to for the health check before it will put the IP address in the DNS entry. The traffic port is not available until the clustering has happened so I am stuck in a catch-22.

Is there a port which RabbitMQ has open when it starts? I tried 4369 but that doesnt seem to work either.

Otherwise we need a config item to bypass this check:

    ThisNode = node(),
    ThisNodeIsIncluded = lists:member(ThisNode, Nodes),
    case ThisNodeIsIncluded of
        true ->
            ok;
        false ->
            ?LOG_DEBUG(
               "Peer discovery: not satisfyied with discovered peers: the "
               "list does not contain this node",
               #{domain => ?RMQLOG_DOMAIN_PEER_DISC})
    end,

which is a good check but in this instance is not required or desired

May 22 '24 08:05 womblep

rabbitmq-server rabbitmq-server copied to clipboard

Peer Discovery with DNS record wont cluster

Describe the bug

Reproduction steps

Expected behavior

Additional context

rabbitmq-server
rabbitmq-server copied to clipboard