rabbitmq-server
rabbitmq-server copied to clipboard
Feature flags detection sometimes triggers `erpc,noconnection`
Describe the bug
- Start a RabbitMQ cluster
- Restart a node
Logs
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags: on node `rabbit@rabbit2`:
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags: exception error: {erpc,noconnection}
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags: in function erpc:call/5 (erpc.erl, line 710)
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags: in call from rabbit_ff_controller:rpc_call/5 (rabbit_ff_controller.erl, line 1123)
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags: in call from lists:foreach_1/2 (lists.erl, line 1442)
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags: in call from rabbit_feature_flags:check_node_compatibility_v1/2 (rabbit_feature_flags.erl, line 1599)
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags: in call from rabbit_mnesia:check_rabbit_consistency/2 (rabbit_mnesia.erl, line 1017)
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags: in call from rabbit_mnesia:check_consistency/5 (rabbit_mnesia.erl, line 948)
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags: in call from rabbit_mnesia:check_cluster_consistency/2 (rabbit_mnesia.erl, line 746)
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags: in call from lists:foldl/3 (lists.erl, line 1350)
2023-05-24 01:39:55.227067-07:00 [error] <0.231.0>
2023-05-24 01:39:55.243345-07:00 [error] <0.277.0> Mnesia(rabbit@rabbit3): ** ERROR ** Mnesia on rabbit@rabbit3 could not connect to node(s) [rabbit@rabbit2]
Reproduction steps
See above.
Expected behavior
No erpc error - either it is re-tried, or it is not tried until disterl is definitely up and running.
Additional context
Observed in the following situations:
- https://pivotal-esc.atlassian.net/browse/VESC-1073
- https://github.com/rabbitmq/rabbitmq-server/issues/8114
- https://vmware.slack.com/archives/C0RDGG81Z/p1684967685447889
I think the expected behavior should be "the operation is retried N times" :)
We stumbled over this by user error in #10100 and as requested, here is the step by step to get the same error message. Although, bear in mind that this happened to me only because I forgot the "rabbit@" when trying to call join_cluster:
$ docker network create test_network
1947438e01b9cced503ba3044be1afb1f5a6225fb64d265257b3547b947cad64
$ docker run -d --network test_network --name rabbit1 --privileged -v $(pwd)/cookie:/var/lib/rabbitmq/.erlang.cookie pivotalrabbitmq/rabbitmq:main-otp-max-bazel
b29a66ec3350cb7ee60975d3a1b8c0bd7918313f30833be76a113d0ea0c78590
$ docker container ls
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b29a66ec3350 pivotalrabbitmq/rabbitmq:main-otp-max-bazel "docker-entrypoint.s…" 38 seconds ago Up 36 seconds 1883/tcp, 4369/tcp, 5551-5552/tcp, 5671-5672/tcp, 8883/tcp, 15670-15676/tcp, 15691-15692/tcp, 25672/tcp, 61613-61614/tcp rabbit1
$ docker exec -it b2 /bin/bash
root@b29a66ec3350:/# rabbitmqctl join_cluster this_node_does_not_exist
Clustering node rabbit@b29a66ec3350 with this_node_does_not_exist
13:03:53.487 [error] Feature flags: error while running:
Feature flags: rabbit_ff_controller:running_nodes[]
Feature flags: on node `this_node_does_not_exist@b29a66ec3350`:
Feature flags: exception error: {erpc,noconnection}
Feature flags: in function erpc:call/5 (erpc.erl, line 710)
Feature flags: in call from rabbit_ff_controller:rpc_call/5 (rabbit_ff_controller.erl, line 1377)
Feature flags: in call from rabbit_ff_controller:list_nodes_clustered_with/1 (rabbit_ff_controller.erl, line 477)
Feature flags: in call from rabbit_ff_controller:check_node_compatibility_task/2 (rabbit_ff_controller.erl, line 389)
Feature flags: in call from rabbit_db_cluster:can_join/1 (rabbit_db_cluster.erl, line 65)
Feature flags: in call from rabbit_db_cluster:join/2 (rabbit_db_cluster.erl, line 97)
Feature flags: in call from erpc:execute_call/4 (erpc.erl, line 589)
Error:
{:aborted_feature_flags_compat_check, {:error, {:erpc, :noconnection}}}
root@b29a66ec3350:/#
It's not clear to me from this log what exactly logs this message: the node or the shell where rabbitmqctl join_cluster this_node_does_not_exist is executed?
In any case, join_cluster should bail early if it cannot contact its not-to-be-joint.